Data Cleaning¶

Halloween Candy¶

Overview and End Goal¶

Our end goal for this project is to clean the data so that we could then create a machine learning model. We want to see if we are able to predict a person's gender based purely on their candy preferences. Although, you will not be creating a model for this assignment, only cleaning the data. The results of the models that I used after cleaning the data are provided at the end of this notebook.

Initial Import & Exploration¶

In [1]:
# initial imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)

Let's start by importing our data and creating a DataFrame called candy. We need to include encoding='iso-8859-1' during the import because there are special characters in the data that Pandas doesn't recognize. This happens a lot when attempting to import data where the public is able to input answers, especially if there are foreign language characters included. The normal encoding for Pandas is utf-8, so changing the encoding allows Pandas to recognize those special characters.

Run the following code, with the encoding argument, and it should import correctly.

In [2]:
# read_csv with iso-8859-1 encoding; using latin-1 would also work here
candy_full = pd.read_csv('candy.csv', encoding='iso-8859-1')

# copy to new DF so that we can have a copy of the original import if needed
candy = candy_full.copy()
candy
Out[2]:
Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks ... Q8: DESPAIR OTHER Q9: OTHER COMMENTS Q10: DRESS Unnamed: 113 Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y)
0 90258773 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 90272821 No Male 44 USA NM MEH DESPAIR JOY MEH ... NaN Bottom line is Twix is really the only candy w... White and gold NaN Sunday NaN 1.0 NaN NaN (84, 25)
2 90272829 NaN Male 49 USA Virginia NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 90272840 No Male 40 us or MEH DESPAIR JOY MEH ... NaN Raisins can go to hell White and gold NaN Sunday NaN 1.0 NaN NaN (75, 23)
4 90272841 No Male 23 usa exton pa JOY DESPAIR JOY DESPAIR ... NaN NaN White and gold NaN Friday NaN 1.0 NaN NaN (70, 10)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 90314359 No Male 24 USA MD JOY DESPAIR MEH DESPAIR ... Fruit Stripe Gum NaN White and gold NaN Friday NaN NaN NaN NaN NaN
2475 90314580 No Female 33 USA New York MEH DESPAIR JOY NaN ... Capers NaN Blue and black NaN Friday NaN 1.0 NaN NaN (70, 26)
2476 90314634 No Female 26 USA Tennessee MEH DESPAIR JOY DESPAIR ... NaN NaN Blue and black NaN Friday NaN 1.0 NaN NaN (67, 35)
2477 90314658 No Male 58 Usa North Carolina NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 90314802 No Female 66 usa Pennsylvania DESPAIR DESPAIR JOY DESPAIR ... NaN You hit all my chocolate highlights, and broug... White and gold NaN Sunday 1.0 NaN NaN NaN (19, 26)

2479 rows × 120 columns

Let's take a brief look at the data by using head().

In [3]:
candy.shape
Out[3]:
(2479, 120)
In [4]:
# first five rows
candy.head()
Out[4]:
Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks ... Q8: DESPAIR OTHER Q9: OTHER COMMENTS Q10: DRESS Unnamed: 113 Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y)
0 90258773 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 90272821 No Male 44 USA NM MEH DESPAIR JOY MEH ... NaN Bottom line is Twix is really the only candy w... White and gold NaN Sunday NaN 1.0 NaN NaN (84, 25)
2 90272829 NaN Male 49 USA Virginia NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 90272840 No Male 40 us or MEH DESPAIR JOY MEH ... NaN Raisins can go to hell White and gold NaN Sunday NaN 1.0 NaN NaN (75, 23)
4 90272841 No Male 23 usa exton pa JOY DESPAIR JOY DESPAIR ... NaN NaN White and gold NaN Friday NaN 1.0 NaN NaN (70, 10)

5 rows × 120 columns

Next, run the following code to see information about the DataFrame.

In [5]:
# check info about the DataFrame
candy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2479 entries, 0 to 2478
Columns: 120 entries, Internal ID to Click Coordinates (x, y)
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB

Notice that this did not print the columns as you might be used to seeing. According to the Pandas documentation: "If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used."

We can make the columns display by setting the max_cols argument equal to the number of columns in the data set.

In [6]:
candy.shape
Out[6]:
(2479, 120)
In [7]:
# check info, set max_cols
candy.info(max_cols=120)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2479 entries, 0 to 2478
Data columns (total 120 columns):
 #    Column                                                                                 Non-Null Count  Dtype  
---   ------                                                                                 --------------  -----  
 0    Internal ID                                                                            2479 non-null   int64  
 1    Q1: GOING OUT?                                                                         2368 non-null   object 
 2    Q2: GENDER                                                                             2437 non-null   object 
 3    Q3: AGE                                                                                2394 non-null   object 
 4    Q4: COUNTRY                                                                            2414 non-null   object 
 5    Q5: STATE, PROVINCE, COUNTY, ETC                                                       2377 non-null   object 
 6    Q6 | 100 Grand Bar                                                                     1728 non-null   object 
 7    Q6 | Anonymous brown globs that come in black and orange wrappers	(a.k.a. Mary Janes)  1741 non-null   object 
 8    Q6 | Any full-sized candy bar                                                          1803 non-null   object 
 9    Q6 | Black Jacks                                                                       1517 non-null   object 
 10   Q6 | Bonkers (the candy)                                                               1482 non-null   object 
 11   Q6 | Bonkers (the board game)                                                          1469 non-null   object 
 12   Q6 | Bottle Caps                                                                       1709 non-null   object 
 13   Q6 | Box'o'Raisins                                                                     1787 non-null   object 
 14   Q6 | Broken glow stick                                                                 1767 non-null   object 
 15   Q6 | Butterfinger                                                                      1793 non-null   object 
 16   Q6 | Cadbury Creme Eggs                                                                1792 non-null   object 
 17   Q6 | Candy Corn                                                                        1796 non-null   object 
 18   Q6 | Candy that is clearly just the stuff given out for free at restaurants            1784 non-null   object 
 19   Q6 | Caramellos                                                                        1723 non-null   object 
 20   Q6 | Cash, or other forms of legal tender                                              1795 non-null   object 
 21   Q6 | Chardonnay                                                                        1732 non-null   object 
 22   Q6 | Chick-o-Sticks (we donÕt know what that is)                                       1528 non-null   object 
 23   Q6 | Chiclets                                                                          1764 non-null   object 
 24   Q6 | Coffee Crisp                                                                      1621 non-null   object 
 25   Q6 | Creepy Religious comics/Chick Tracts                                              1771 non-null   object 
 26   Q6 | Dental paraphenalia                                                               1783 non-null   object 
 27   Q6 | Dots                                                                              1746 non-null   object 
 28   Q6 | Dove Bars                                                                         1773 non-null   object 
 29   Q6 | Fuzzy Peaches                                                                     1652 non-null   object 
 30   Q6 | Generic Brand Acetaminophen                                                       1744 non-null   object 
 31   Q6 | Glow sticks                                                                       1778 non-null   object 
 32   Q6 | Goo Goo Clusters                                                                  1595 non-null   object 
 33   Q6 | Good N' Plenty                                                                    1741 non-null   object 
 34   Q6 | Gum from baseball cards                                                           1759 non-null   object 
 35   Q6 | Gummy Bears straight up                                                           1778 non-null   object 
 36   Q6 | Hard Candy                                                                        1780 non-null   object 
 37   Q6 | Healthy Fruit                                                                     1781 non-null   object 
 38   Q6 | Heath Bar                                                                         1763 non-null   object 
 39   Q6 | Hershey's Dark Chocolate                                                          1802 non-null   object 
 40   Q6 | HersheyÕs Milk Chocolate                                                          1803 non-null   object 
 41   Q6 | Hershey's Kisses                                                                  1797 non-null   object 
 42   Q6 | Hugs (actual physical hugs)                                                       1762 non-null   object 
 43   Q6 | Jolly Rancher (bad flavor)                                                        1781 non-null   object 
 44   Q6 | Jolly Ranchers (good flavor)                                                      1780 non-null   object 
 45   Q6 | JoyJoy (Mit Iodine!)                                                              1449 non-null   object 
 46   Q6 | Junior Mints                                                                      1777 non-null   object 
 47   Q6 | Senior Mints                                                                      1532 non-null   object 
 48   Q6 | Kale smoothie                                                                     1731 non-null   object 
 49   Q6 | Kinder Happy Hippo                                                                1528 non-null   object 
 50   Q6 | Kit Kat                                                                           1801 non-null   object 
 51   Q6 | LaffyTaffy                                                                        1739 non-null   object 
 52   Q6 | LemonHeads                                                                        1745 non-null   object 
 53   Q6 | Licorice (not black)                                                              1789 non-null   object 
 54   Q6 | Licorice (yes black)                                                              1790 non-null   object 
 55   Q6 | Lindt Truffle                                                                     1757 non-null   object 
 56   Q6 | Lollipops                                                                         1784 non-null   object 
 57   Q6 | Mars                                                                              1750 non-null   object 
 58   Q6 | Maynards                                                                          1450 non-null   object 
 59   Q6 | Mike and Ike                                                                      1746 non-null   object 
 60   Q6 | Milk Duds                                                                         1782 non-null   object 
 61   Q6 | Milky Way                                                                         1787 non-null   object 
 62   Q6 | Regular M&Ms                                                                      1800 non-null   object 
 63   Q6 | Peanut M&MÕs                                                                      1804 non-null   object 
 64   Q6 | Blue M&M's                                                                        1748 non-null   object 
 65   Q6 | Red M&M's                                                                         1746 non-null   object 
 66   Q6 | Green Party M&M's                                                                 1711 non-null   object 
 67   Q6 | Independent M&M's                                                                 1662 non-null   object 
 68   Q6 | Abstained from M&M'ing.                                                           1532 non-null   object 
 69   Q6 | Minibags of chips                                                                 1751 non-null   object 
 70   Q6 | Mint Kisses                                                                       1699 non-null   object 
 71   Q6 | Mint Juleps                                                                       1664 non-null   object 
 72   Q6 | Mr. Goodbar                                                                       1735 non-null   object 
 73   Q6 | Necco Wafers                                                                      1731 non-null   object 
 74   Q6 | Nerds                                                                             1752 non-null   object 
 75   Q6 | Nestle Crunch                                                                     1777 non-null   object 
 76   Q6 | Now'n'Laters                                                                      1657 non-null   object 
 77   Q6 | Peeps                                                                             1765 non-null   object 
 78   Q6 | Pencils                                                                           1766 non-null   object 
 79   Q6 | Pixy Stix                                                                         1753 non-null   object 
 80   Q6 | Real Housewives of Orange County Season 9 Blue-Ray                                1722 non-null   object 
 81   Q6 | ReeseÕs Peanut Butter Cups                                                        1796 non-null   object 
 82   Q6 | Reese's Pieces                                                                    1784 non-null   object 
 83   Q6 | Reggie Jackson Bar                                                                1460 non-null   object 
 84   Q6 | Rolos                                                                             1761 non-null   object 
 85   Q6 | Sandwich-sized bags filled with BooBerry Crunch                                   1699 non-null   object 
 86   Q6 | Skittles                                                                          1769 non-null   object 
 87   Q6 | Smarties (American)                                                               1750 non-null   object 
 88   Q6 | Smarties (Commonwealth)                                                           1573 non-null   object 
 89   Q6 | Snickers                                                                          1785 non-null   object 
 90   Q6 | Sourpatch Kids (i.e. abominations of nature)                                      1737 non-null   object 
 91   Q6 | Spotted Dick                                                                      1593 non-null   object 
 92   Q6 | Starburst                                                                         1782 non-null   object 
 93   Q6 | Sweet Tarts                                                                       1767 non-null   object 
 94   Q6 | Swedish Fish                                                                      1760 non-null   object 
 95   Q6 | Sweetums (a friend to diabetes)                                                   1472 non-null   object 
 96   Q6 | Take 5                                                                            1557 non-null   object 
 97   Q6 | Tic Tacs                                                                          1761 non-null   object 
 98   Q6 | Those odd marshmallow circus peanut things                                        1739 non-null   object 
 99   Q6 | Three Musketeers                                                                  1767 non-null   object 
 100  Q6 | Tolberone something or other                                                      1769 non-null   object 
 101  Q6 | Trail Mix                                                                         1767 non-null   object 
 102  Q6 | Twix                                                                              1785 non-null   object 
 103  Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein            1683 non-null   object 
 104  Q6 | Vicodin                                                                           1686 non-null   object 
 105  Q6 | Whatchamacallit Bars                                                              1652 non-null   object 
 106  Q6 | White Bread                                                                       1718 non-null   object 
 107  Q6 | Whole Wheat anything                                                              1728 non-null   object 
 108  Q6 | York Peppermint Patties                                                           1770 non-null   object 
 109  Q7: JOY OTHER                                                                          917 non-null    object 
 110  Q8: DESPAIR OTHER                                                                      723 non-null    object 
 111  Q9: OTHER COMMENTS                                                                     389 non-null    object 
 112  Q10: DRESS                                                                             1730 non-null   object 
 113  Unnamed: 113                                                                           9 non-null      object 
 114  Q11: DAY                                                                               1750 non-null   object 
 115  Q12: MEDIA [Daily Dish]                                                                85 non-null     float64
 116  Q12: MEDIA [Science]                                                                   1376 non-null   float64
 117  Q12: MEDIA [ESPN]                                                                      99 non-null     float64
 118  Q12: MEDIA [Yahoo]                                                                     67 non-null     float64
 119  Click Coordinates (x, y)                                                               1619 non-null   object 
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB

Of course, if you are just looking for the column names, you can just use a simple for loop.

In [8]:
# print a list of column names
for col in candy.columns:
    print(col)
    
Internal ID
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q5: STATE, PROVINCE, COUNTY, ETC
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers	(a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box'o'Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N' Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey's Dark Chocolate
Q6 | HersheyÕs Milk Chocolate
Q6 | Hershey's Kisses
Q6 | Hugs (actual physical hugs)
Q6 | Jolly Rancher (bad flavor)
Q6 | Jolly Ranchers (good flavor)
Q6 | JoyJoy (Mit Iodine!)
Q6 | Junior Mints
Q6 | Senior Mints
Q6 | Kale smoothie
Q6 | Kinder Happy Hippo
Q6 | Kit Kat
Q6 | LaffyTaffy
Q6 | LemonHeads
Q6 | Licorice (not black)
Q6 | Licorice (yes black)
Q6 | Lindt Truffle
Q6 | Lollipops
Q6 | Mars
Q6 | Maynards
Q6 | Mike and Ike
Q6 | Milk Duds
Q6 | Milky Way
Q6 | Regular M&Ms
Q6 | Peanut M&MÕs
Q6 | Blue M&M's
Q6 | Red M&M's
Q6 | Green Party M&M's
Q6 | Independent M&M's
Q6 | Abstained from M&M'ing.
Q6 | Minibags of chips
Q6 | Mint Kisses
Q6 | Mint Juleps
Q6 | Mr. Goodbar
Q6 | Necco Wafers
Q6 | Nerds
Q6 | Nestle Crunch
Q6 | Now'n'Laters
Q6 | Peeps
Q6 | Pencils
Q6 | Pixy Stix
Q6 | Real Housewives of Orange County Season 9 Blue-Ray
Q6 | ReeseÕs Peanut Butter Cups
Q6 | Reese's Pieces
Q6 | Reggie Jackson Bar
Q6 | Rolos
Q6 | Sandwich-sized bags filled with BooBerry Crunch
Q6 | Skittles
Q6 | Smarties (American)
Q6 | Smarties (Commonwealth)
Q6 | Snickers
Q6 | Sourpatch Kids (i.e. abominations of nature)
Q6 | Spotted Dick
Q6 | Starburst
Q6 | Sweet Tarts
Q6 | Swedish Fish
Q6 | Sweetums (a friend to diabetes)
Q6 | Take 5
Q6 | Tic Tacs
Q6 | Those odd marshmallow circus peanut things
Q6 | Three Musketeers
Q6 | Tolberone something or other
Q6 | Trail Mix
Q6 | Twix
Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
Q6 | Vicodin
Q6 | Whatchamacallit Bars
Q6 | White Bread
Q6 | Whole Wheat anything
Q6 | York Peppermint Patties
Q7: JOY OTHER
Q8: DESPAIR OTHER
Q9: OTHER COMMENTS
Q10: DRESS
Unnamed: 113
Q11: DAY
Q12: MEDIA [Daily Dish]
Q12: MEDIA [Science]
Q12: MEDIA [ESPN]
Q12: MEDIA [Yahoo]
Click Coordinates (x, y)

This data set is pretty messy. Your goal is now to perform the following actions to get it to the point where it can be passed to a machine learning model.

Note: Unless the instructions ask you to do something different, please always update the original candy DataFrame for the exercises below. The automatic grading in CodeGrade will check your final DataFrame and ensure that you have performed all required data manipulations. Also, feel free to add additional cells as needed.

Data Cleaning¶

Exercise_A: Taking a look at the column names, you may notice that some include the character Õ. This should instead be an apostrophe ' mark. Rename the column names that include the Õ character and replace it was an apostrophe.

Remember that you should be updating the candy DataFrame for the tasks listed as "Exercises" unless told differently.

In [9]:
# Identify the columns with special character Õ 
 #We find one column under Q6 | HersheyÕs Milk Chocolate

candy.filter(like="col").select_dtypes("object").columns

#This gives us the following, before we replace

#Index(['Q6 | Hershey's Dark Chocolate', 'Q6 | HersheyÕs Milk Chocolate'], dtype='object')
Out[9]:
Index(['Q6 | Hershey's Dark Chocolate', 'Q6 | HersheyÕs Milk Chocolate'], dtype='object')
In [10]:
# Now we rename Q6 | HersheyÕs Milk Chocolate column as Q6_new | 'Hershey's Milk Chocolate'
# And in a way we replace now the special character Õ to an apostrophe

candy = candy.rename(columns={"Q6 | HersheyÕs Milk Chocolate": "Q6_new | Hershey's Milk Chocolate"})  ### ENTER CODE HERE ###
candy
Out[10]:
Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks ... Q8: DESPAIR OTHER Q9: OTHER COMMENTS Q10: DRESS Unnamed: 113 Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y)
0 90258773 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 90272821 No Male 44 USA NM MEH DESPAIR JOY MEH ... NaN Bottom line is Twix is really the only candy w... White and gold NaN Sunday NaN 1.0 NaN NaN (84, 25)
2 90272829 NaN Male 49 USA Virginia NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 90272840 No Male 40 us or MEH DESPAIR JOY MEH ... NaN Raisins can go to hell White and gold NaN Sunday NaN 1.0 NaN NaN (75, 23)
4 90272841 No Male 23 usa exton pa JOY DESPAIR JOY DESPAIR ... NaN NaN White and gold NaN Friday NaN 1.0 NaN NaN (70, 10)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 90314359 No Male 24 USA MD JOY DESPAIR MEH DESPAIR ... Fruit Stripe Gum NaN White and gold NaN Friday NaN NaN NaN NaN NaN
2475 90314580 No Female 33 USA New York MEH DESPAIR JOY NaN ... Capers NaN Blue and black NaN Friday NaN 1.0 NaN NaN (70, 26)
2476 90314634 No Female 26 USA Tennessee MEH DESPAIR JOY DESPAIR ... NaN NaN Blue and black NaN Friday NaN 1.0 NaN NaN (67, 35)
2477 90314658 No Male 58 Usa North Carolina NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 90314802 No Female 66 usa Pennsylvania DESPAIR DESPAIR JOY DESPAIR ... NaN You hit all my chocolate highlights, and broug... White and gold NaN Sunday 1.0 NaN NaN NaN (19, 26)

2479 rows × 120 columns

In [11]:
#Check 1: We get the following change, after renaming and replacing 

candy.filter(like="col").select_dtypes("object").columns
Out[11]:
Index(['Q6 | Hershey's Dark Chocolate', 'Q6_new | Hershey's Milk Chocolate'], dtype='object')
In [12]:
#Check#2 Looking into this dataset
candy["Q6 | Hershey's Dark Chocolate"]
Out[12]:
0       NaN
1       JOY
2       NaN
3       MEH
4       JOY
       ... 
2474    JOY
2475    MEH
2476    JOY
2477    NaN
2478    JOY
Name: Q6 | Hershey's Dark Chocolate, Length: 2479, dtype: object

Q1: How many duplicated rows are there in the file? Assume that a duplicate is any row that is exactly the same as another one. Save this number as Q1.

In [13]:
#Number of duplicated rows
Q1 = candy.duplicated().sum()      
Q1
Out[13]:
17

Q2: How many duplicated rows are there in the file if we were to assume that a duplicate is any row with the same Internal ID number as another. In other words, even if the other values are different, a row would count as a duplicate if it had the same Internal ID as another. Save this number as Q2.

In [14]:
#Number of duplicated rows with same Internal ID number
Q2 = candy[candy.duplicated('Internal ID')]          
Q2 = len(Q2)
Q2
Out[14]:
19

Exercise_B: Drop any duplicates from the candy DataFrame. Duplicates are to be defined as any row with the same Internal ID as another. Use the default setting that keeps the first record from the duplicates.

In [15]:
##Drop any duplicates from the candy DataFrame

#subset is used to remove duplicates from specific column
#To remove duplicates from entire dataset, we use: df.drop_duplicates()
# keep option is set to ‘first’ to remove duplicates and keep the first occurrences only
# keep option is set to False to remove all the occurrences of duplicate column(s)

candy = candy.drop_duplicates(subset = 'Internal ID', keep='first') 
candy
Out[15]:
Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks ... Q8: DESPAIR OTHER Q9: OTHER COMMENTS Q10: DRESS Unnamed: 113 Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y)
0 90258773 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 90272821 No Male 44 USA NM MEH DESPAIR JOY MEH ... NaN Bottom line is Twix is really the only candy w... White and gold NaN Sunday NaN 1.0 NaN NaN (84, 25)
2 90272829 NaN Male 49 USA Virginia NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 90272840 No Male 40 us or MEH DESPAIR JOY MEH ... NaN Raisins can go to hell White and gold NaN Sunday NaN 1.0 NaN NaN (75, 23)
4 90272841 No Male 23 usa exton pa JOY DESPAIR JOY DESPAIR ... NaN NaN White and gold NaN Friday NaN 1.0 NaN NaN (70, 10)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 90314359 No Male 24 USA MD JOY DESPAIR MEH DESPAIR ... Fruit Stripe Gum NaN White and gold NaN Friday NaN NaN NaN NaN NaN
2475 90314580 No Female 33 USA New York MEH DESPAIR JOY NaN ... Capers NaN Blue and black NaN Friday NaN 1.0 NaN NaN (70, 26)
2476 90314634 No Female 26 USA Tennessee MEH DESPAIR JOY DESPAIR ... NaN NaN Blue and black NaN Friday NaN 1.0 NaN NaN (67, 35)
2477 90314658 No Male 58 Usa North Carolina NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 90314802 No Female 66 usa Pennsylvania DESPAIR DESPAIR JOY DESPAIR ... NaN You hit all my chocolate highlights, and broug... White and gold NaN Sunday 1.0 NaN NaN NaN (19, 26)

2460 rows × 120 columns

Exercise_C: Your next task is to remove the following columns from the candy DataFrame as we will not use these columns for this project. You are welcome to do further analysis on these columns but do not save your analysis in this notebook.

Remove the following columns: Internal ID, Q5: STATE, PROVINCE, COUNTY, ETC, Q7: JOY OTHER, Q8: DESPAIR OTHER, Q9: OTHER COMMENTS, Unnamed: 113, Click Coordinates (x, y).

In [16]:
#Remove the following multiple columns from the candy DataFrame:

candy = candy.drop(
    columns=["Internal ID","Q5: STATE, PROVINCE, COUNTY, ETC",
             "Q7: JOY OTHER", "Q8: DESPAIR OTHER", "Q9: OTHER COMMENTS", 
             "Unnamed: 113", "Click Coordinates (x, y)"], axis = 1)   

candy
Out[16]:
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) ... Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 No Male 44 USA MEH DESPAIR JOY MEH DESPAIR DESPAIR ... DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday NaN 1.0 NaN NaN
2 NaN Male 49 USA NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 No Male 40 us MEH DESPAIR JOY MEH MEH DESPAIR ... JOY DESPAIR DESPAIR DESPAIR White and gold Sunday NaN 1.0 NaN NaN
4 No Male 23 usa JOY DESPAIR JOY DESPAIR MEH DESPAIR ... JOY DESPAIR DESPAIR JOY White and gold Friday NaN 1.0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 No Male 24 USA JOY DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR MEH DESPAIR MEH White and gold Friday NaN NaN NaN NaN
2475 No Female 33 USA MEH DESPAIR JOY NaN NaN NaN ... JOY DESPAIR MEH JOY Blue and black Friday NaN 1.0 NaN NaN
2476 No Female 26 USA MEH DESPAIR JOY DESPAIR MEH JOY ... MEH DESPAIR DESPAIR MEH Blue and black Friday NaN 1.0 NaN NaN
2477 No Male 58 Usa NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 No Female 66 usa DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR JOY White and gold Sunday 1.0 NaN NaN NaN

2460 rows × 113 columns

Code Check: As a check for the above exercises, the shape of your data should now be: (2460, 113)

In [17]:
#Code check
candy.shape                          
Out[17]:
(2460, 113)

Exercise_D: Let's now take a look at the Q2: GENDER column since this will be what we are trying to predict. Take a look at the value counts for this column.

In [18]:
# view value counts for Q2: GENDER
candy['Q2: GENDER'].value_counts()         
Out[18]:
Male                  1466
Female                 839
I'd rather not say      83
Other                   30
Name: Q2: GENDER, dtype: int64

Q3: How many missing values are in the Q2: GENDER column? Save this as Q3.

In [19]:
# We use this to create True/False mask
candy['Q2: GENDER'].isnull()
Out[19]:
0        True
1       False
2       False
3       False
4       False
        ...  
2474    False
2475    False
2476    False
2477    False
2478    False
Name: Q2: GENDER, Length: 2460, dtype: bool
In [20]:
# Number of missing values from data
Q3 = candy['Q2: GENDER'].isnull().sum()        
Q3 
Out[20]:
42

Exercise_E: Using the candy DataFrame, remove all rows with a missing value in the Q2: GENDER column. (This should overwrite and be saved as candy like you have been doing for the previous exercises.)

In [21]:
# Remove all rows with a missing value in the Q2: GENDER column

candy = candy.dropna(subset=['Q2: GENDER'])
In [22]:
#Candy shape                  #from 2460 to 2418 = 2460-42
candy.shape             
Out[22]:
(2418, 113)
In [23]:
#Code check
candy['Q2: GENDER'].isnull().sum()
Out[23]:
0

Exercise_F: For this project, we want to use binary classification, which predicts one of two classes. We want to predict between Male or Female. Because of this, select only the rows that contain either Male or Female in the Q2: GENDER column.

In [24]:
# Select only the rows that contain either Male or Female in the Q2: GENDER column
candy = candy[candy['Q2: GENDER'].isin(['Male', 'Female'])]
candy
Out[24]:
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) ... Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
1 No Male 44 USA MEH DESPAIR JOY MEH DESPAIR DESPAIR ... DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday NaN 1.0 NaN NaN
2 NaN Male 49 USA NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 No Male 40 us MEH DESPAIR JOY MEH MEH DESPAIR ... JOY DESPAIR DESPAIR DESPAIR White and gold Sunday NaN 1.0 NaN NaN
4 No Male 23 usa JOY DESPAIR JOY DESPAIR MEH DESPAIR ... JOY DESPAIR DESPAIR JOY White and gold Friday NaN 1.0 NaN NaN
5 No Male NaN NaN JOY DESPAIR JOY NaN NaN NaN ... JOY DESPAIR DESPAIR JOY NaN NaN NaN 1.0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 No Male 24 USA JOY DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR MEH DESPAIR MEH White and gold Friday NaN NaN NaN NaN
2475 No Female 33 USA MEH DESPAIR JOY NaN NaN NaN ... JOY DESPAIR MEH JOY Blue and black Friday NaN 1.0 NaN NaN
2476 No Female 26 USA MEH DESPAIR JOY DESPAIR MEH JOY ... MEH DESPAIR DESPAIR MEH Blue and black Friday NaN 1.0 NaN NaN
2477 No Male 58 Usa NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 No Female 66 usa DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR JOY White and gold Sunday 1.0 NaN NaN NaN

2305 rows × 113 columns

Code Check: As a check for the above exercises, the shape of your data should now be: (2305, 113)

In [25]:
#Code check
candy.shape
Out[25]:
(2305, 113)

Now, let's work on filling some of the missing data. There are easier ways to do this with the sklearn library which you will learn about more in the machine learning classes, but for now, let's try to practice our Pandas skills.

Q4: How many missing values are in the Q1: GOING OUT? column? Save this number as Q4.

In [26]:
Q4 = candy['Q1: GOING OUT?'].isnull().sum()       #77 
Q4     
Out[26]:
77

Exercise_G: For a future analysis question, we are interested in those that we know will definitely go out for Halloween. Because of this, fill all missing values in the Q1: GOING OUT? column with a No value.

In [27]:
#Fill all missing values in the Q1: GOING OUT? column with a 'No' value.
candy['Q1: GOING OUT?'].fillna('No', inplace=True)    

Code Check: Double check your above work and look at the value counts for the Q1: GOING OUT? column. Make sure that you only have "Yes" and No" values and that they add up to 2305, which is the number of rows you should have at this step in the assignment.

In [28]:
#Code check  #total No + Yes = 2305

candy['Q1: GOING OUT?'].value_counts()
Out[28]:
No     2007
Yes     298
Name: Q1: GOING OUT?, dtype: int64
In [29]:
#code check for missing values
candy['Q1: GOING OUT?'].isnull().sum()
Out[29]:
0

Q5: To get ready for the next step, let's practice selecting all the columns: going from Q6 | 100 Grand Bar to Q11: DAY. Save this slice as Q5.

In [30]:
#Select all columns going from: Q6 | 100 Grand Bar to Q11: DAY
# Save this slice as Q5
Q5 = candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY']
Q5
Out[30]:
Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger ... Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY
1 MEH DESPAIR JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR ... DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 MEH DESPAIR JOY MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH ... MEH JOY DESPAIR JOY JOY DESPAIR DESPAIR DESPAIR White and gold Sunday
4 JOY DESPAIR JOY DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR JOY MEH JOY JOY DESPAIR DESPAIR JOY White and gold Friday
5 JOY DESPAIR JOY NaN NaN NaN MEH MEH DESPAIR JOY ... MEH JOY DESPAIR DESPAIR JOY DESPAIR DESPAIR JOY NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 JOY DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR MEH ... JOY JOY MEH JOY DESPAIR MEH DESPAIR MEH White and gold Friday
2475 MEH DESPAIR JOY NaN NaN NaN NaN DESPAIR DESPAIR JOY ... DESPAIR JOY NaN NaN JOY DESPAIR MEH JOY Blue and black Friday
2476 MEH DESPAIR JOY DESPAIR MEH JOY DESPAIR MEH MEH DESPAIR ... MEH MEH MEH JOY MEH DESPAIR DESPAIR MEH Blue and black Friday
2477 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR JOY ... DESPAIR DESPAIR DESPAIR JOY DESPAIR MEH DESPAIR JOY White and gold Sunday

2305 rows × 105 columns

Exercise_H: Now that you know how to slice the data, fill any missing values in the candy DataFrame for those columns (going from Q6 | 100 Grand Bar to Q11: DAY) with the string NO_ANSWER.

In [31]:
#Fill any missing values in the candy DataFrame for those columns 
# (going from Q6 | 100 Grand Bar to Q11: DAY) with the string NO_ANSWER
#I call this new df as Q51, to latter help us join with anther new df

Q51 = candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY'].fillna('NO_ANSWER')
Q51
Out[31]:
Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger ... Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY
1 MEH DESPAIR JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR ... DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday
2 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER
3 MEH DESPAIR JOY MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH ... MEH JOY DESPAIR JOY JOY DESPAIR DESPAIR DESPAIR White and gold Sunday
4 JOY DESPAIR JOY DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR JOY MEH JOY JOY DESPAIR DESPAIR JOY White and gold Friday
5 JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER MEH MEH DESPAIR JOY ... MEH JOY DESPAIR DESPAIR JOY DESPAIR DESPAIR JOY NO_ANSWER NO_ANSWER
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 JOY DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR MEH ... JOY JOY MEH JOY DESPAIR MEH DESPAIR MEH White and gold Friday
2475 MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER DESPAIR DESPAIR JOY ... DESPAIR JOY NO_ANSWER NO_ANSWER JOY DESPAIR MEH JOY Blue and black Friday
2476 MEH DESPAIR JOY DESPAIR MEH JOY DESPAIR MEH MEH DESPAIR ... MEH MEH MEH JOY MEH DESPAIR DESPAIR MEH Blue and black Friday
2477 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER
2478 DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR JOY ... DESPAIR DESPAIR DESPAIR JOY DESPAIR MEH DESPAIR JOY White and gold Sunday

2305 rows × 105 columns

Exercise_I: For all four Q12: Media columns in the candy DataFrame, fill the missing values with 0.0.

In [32]:
#For all four Q12: Media columns in the candy DataFrame
#Fill the missing values with 0.0
#Let's name this new df as Q52

candy.update(candy[['Q12: MEDIA [Daily Dish]', 
                    'Q12: MEDIA [Science]', 
                    'Q12: MEDIA [ESPN]', 
                    'Q12: MEDIA [Yahoo]']].fillna(0))
In [33]:
#1/ Code check for Q12 columns 

     #After First checking, I name this new df as Q52

Q52 = candy[['Q12: MEDIA [Daily Dish]',
             'Q12: MEDIA [Science]',
             'Q12: MEDIA [ESPN]',
             'Q12: MEDIA [Yahoo]']]
Q52
Out[33]:
Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
1 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0
5 0.0 1.0 0.0 0.0
... ... ... ... ...
2474 0.0 0.0 0.0 0.0
2475 0.0 1.0 0.0 0.0
2476 0.0 1.0 0.0 0.0
2477 0.0 0.0 0.0 0.0
2478 1.0 0.0 0.0 0.0

2305 rows × 4 columns

In [34]:
#Now we combine Q51 and Q52 together,and check all missed values filled/replaced
#We use the concat method to join both DFs side by side(hstack)

# We save this new df in candy1 t

candy1 = pd.concat([Q51, Q52], axis=1)
candy1
Out[34]:
Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger ... Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
1 MEH DESPAIR JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR ... DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
2 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
3 MEH DESPAIR JOY MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH ... JOY DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
4 JOY DESPAIR JOY DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH ... JOY DESPAIR DESPAIR JOY White and gold Friday 0.0 1.0 0.0 0.0
5 JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER MEH MEH DESPAIR JOY ... JOY DESPAIR DESPAIR JOY NO_ANSWER NO_ANSWER 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 JOY DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR MEH ... DESPAIR MEH DESPAIR MEH White and gold Friday 0.0 0.0 0.0 0.0
2475 MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER DESPAIR DESPAIR JOY ... JOY DESPAIR MEH JOY Blue and black Friday 0.0 1.0 0.0 0.0
2476 MEH DESPAIR JOY DESPAIR MEH JOY DESPAIR MEH MEH DESPAIR ... MEH DESPAIR DESPAIR MEH Blue and black Friday 0.0 1.0 0.0 0.0
2477 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
2478 DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR JOY ... DESPAIR MEH DESPAIR JOY White and gold Sunday 1.0 0.0 0.0 0.0

2305 rows × 109 columns

Code Check: As a check for the above code, make sure that there are no missing values left for the Q6 to Q12 columns.

In [35]:
# Code check for missing values: 'Q6 | 100 Grand Bar':'Q12: Media'

candy1.isnull().sum()
Out[35]:
Q6 | 100 Grand Bar                                                                        0
Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)    0
Q6 | Any full-sized candy bar                                                             0
Q6 | Black Jacks                                                                          0
Q6 | Bonkers (the candy)                                                                  0
                                                                                         ..
Q11: DAY                                                                                  0
Q12: MEDIA [Daily Dish]                                                                   0
Q12: MEDIA [Science]                                                                      0
Q12: MEDIA [ESPN]                                                                         0
Q12: MEDIA [Yahoo]                                                                        0
Length: 109, dtype: int64

Before proceeding to next, we assign a new DataFrame: candy2 for remaining columns and then next we combine candy1 and cand 2 DataFrames to get new modified candy dataset¶

In [36]:
#Let's assign the rest column a new df: cand2
candy2 = candy.loc[:,'Q1: GOING OUT?':'Q4: COUNTRY']
candy2
Out[36]:
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY
1 No Male 44 USA
2 No Male 49 USA
3 No Male 40 us
4 No Male 23 usa
5 No Male NaN NaN
... ... ... ... ...
2474 No Male 24 USA
2475 No Female 33 USA
2476 No Female 26 USA
2477 No Male 58 Usa
2478 No Female 66 usa

2305 rows × 4 columns

In [37]:
# Now let's combine candy1, and candy2 to get modified new candy dataset
# We save it in candy DataFrame

candy = pd.concat([candy2, candy1], axis=1)
candy 
Out[37]:
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) ... Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
1 No Male 44 USA MEH DESPAIR JOY MEH DESPAIR DESPAIR ... DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
2 No Male 49 USA NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
3 No Male 40 us MEH DESPAIR JOY MEH MEH DESPAIR ... JOY DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
4 No Male 23 usa JOY DESPAIR JOY DESPAIR MEH DESPAIR ... JOY DESPAIR DESPAIR JOY White and gold Friday 0.0 1.0 0.0 0.0
5 No Male NaN NaN JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER ... JOY DESPAIR DESPAIR JOY NO_ANSWER NO_ANSWER 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 No Male 24 USA JOY DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR MEH DESPAIR MEH White and gold Friday 0.0 0.0 0.0 0.0
2475 No Female 33 USA MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER ... JOY DESPAIR MEH JOY Blue and black Friday 0.0 1.0 0.0 0.0
2476 No Female 26 USA MEH DESPAIR JOY DESPAIR MEH JOY ... MEH DESPAIR DESPAIR MEH Blue and black Friday 0.0 1.0 0.0 0.0
2477 No Male 58 Usa NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
2478 No Female 66 usa DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR JOY White and gold Sunday 1.0 0.0 0.0 0.0

2305 rows × 113 columns

Now, let's look at the very messy Q4: COUNTRY column and see what we can do about it. First, run the code below to look at the different unique values in the data.¶

In [38]:
# check unique values, in Q4: COUNTRY column

candy['Q4: COUNTRY'].unique()
Out[38]:
array(['USA ', 'USA', 'us', 'usa', nan, 'canada', 'Canada', 'Us', 'US',
       'Murica', 'United States', 'uk', 'United Kingdom', 'united states',
       'Usa', 'United States ', 'United staes',
       'United States of America', 'UAE', 'England', 'UK', 'canada ',
       'United states', 'u.s.a.', '35', 'france',
       'United States of America ', 'america', 'U.S.A.', 'finland',
       'unhinged states', 'Mexico', 'Canada ', 'united states of america',
       'US of A', 'The United States', 'North Carolina ', 'Unied States',
       'Netherlands', 'germany', 'Europe', 'U S', 'u.s.', 'U.K. ',
       'Costa Rica', 'The United States of America', 'unite states',
       'U.S.', '46', 'Australia', 'Greece', 'USA? Hard to tell anymore..',
       "'merica", '45', 'United State', '32', 'France', 'australia',
       'Can', 'Canae', 'Trumpistan', 'Ireland', 'United Sates', 'Korea',
       'California', 'Unites States', 'Japan', 'USa', 'South africa',
       'I pretend to be from Canada, but I am really from the United States.',
       'Usa ', 'Uk', 'Germany', 'Canada`', 'Scotland', 'UK ', 'Denmark',
       'United Stated', 'France ', 'Switzerland', 'UD', 'Scotland ',
       'South Korea', 'New Jersey', 'CANADA', 'Indonesia',
       'United ststes', 'America', 'The Netherlands', 'United Statss',
       'endland', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ',
       'soviet canuckistan', 'N. America', 'Singapore', 'USSA', 'China',
       'Taiwan', 'Ireland ', 'hong kong', 'spain', 'Sweden', 'Hong Kong',
       'U.S. ', 'Narnia', 'u s a', 'United Statea', 'united ststes',
       'subscribe to dm4uz3 on youtube', 'United kingdom',
       'USA USA USA!!!!', "I don't know anymore", 'Fear and Loathing'],
      dtype=object)

Code Check: As a check for the Country column, check to see how many unique values are in the data. You should have 115 different unique values for the Q4: COUNTRY column. If you have less or more than this number, double check your work above.

In [39]:
# check the Q4: COUNTRY number of unique values

candy['Q4: COUNTRY'].nunique()
Out[39]:
115

We want to clean up this data to only include four areas: USA, Canada, Europe (the continent, not necessarily the European Union), and Other.

There are different ways to do this, but I would suggest that you look at the way we handled the property_type column in the vienna data set and the code in the amenities_to_columns() function in the module notebook. These might be a little harder than those examples but they should give you a good baseline approach.

You could use replace() for this step, and it is fine if you ultimately decide to do this, but I would suggest that you come up with a solution similar to what was shown in the vienna data cleaning notebook. This method would be much more robust if you had many more values in your data.

I suggest the following order for this section to make it easier:

  • Fill in all missing values with Other
  • Code Australia as Other (doing this step will help when trying to use us in the next step if you use string methods)
  • Combine all USA entries together as USA
  • Combine Canadian entries as CA
  • Combine European entries as EU
  • Everything else gets coded as Other

Exercise_J: Fill the missing values in the Q4: Country column with Other.

In [40]:
#View the Q4: Country column
candy['Q4: COUNTRY']
Out[40]:
1       USA 
2        USA
3         us
4        usa
5        NaN
        ... 
2474     USA
2475     USA
2476    USA 
2477     Usa
2478     usa
Name: Q4: COUNTRY, Length: 2305, dtype: object
In [41]:
#Check for value counts
candy['Q4: COUNTRY'].value_counts()
Out[41]:
USA                            669
United States                  475
usa                            210
Canada                         168
Usa                            131
                              ... 
United State                     1
45                               1
'merica                          1
USA? Hard to tell anymore..      1
Fear and Loathing                1
Name: Q4: COUNTRY, Length: 115, dtype: int64
In [42]:
#Fill the missing values in the Q4: Country column with 'Other'

candy['Q4: COUNTRY'] = candy['Q4: COUNTRY'].fillna('Other')
candy['Q4: COUNTRY']
Out[42]:
1        USA 
2         USA
3          us
4         usa
5       Other
        ...  
2474      USA
2475      USA
2476     USA 
2477      Usa
2478      usa
Name: Q4: COUNTRY, Length: 2305, dtype: object

Code Check: Double check that there are no missing values in the Q4: COUNTRY column. Also, double check the unique values to make sure that "Other" was added. This should mean that you now have 116 unique values for this column.

In [43]:
# check missing Q4 values
candy['Q4: COUNTRY'].isnull().sum()     
Out[43]:
0
In [44]:
# check unique values 

candy['Q4: COUNTRY'].nunique()   
Out[44]:
116

Exercise_K: Combine all Australia entries into Other. Watch out for capitalization issues. You should have 114 unique values after this step.

In [45]:
#Combine all Australia entries into Other
candy['Q4: COUNTRY'].replace({'Australia':'Other',
                              'australia':'Other'
                             }, inplace=True)
In [46]:
# check unique values
candy['Q4: COUNTRY'].nunique() 
Out[46]:
114
In [47]:
candy['Q4: COUNTRY'] == 'O'
Out[47]:
1       False
2       False
3       False
4       False
5       False
        ...  
2474    False
2475    False
2476    False
2477    False
2478    False
Name: Q4: COUNTRY, Length: 2305, dtype: bool

Exercise_L: Combine all United States entries together into USA. These would include the following:

'USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'United states', 'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states','U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates', 'California', 'Unites States', 'USa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated', 'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes', 'USA USA USA!!!!'
In [48]:
#Combine all United States entries together into USA
candy['Q4: COUNTRY'].replace({ 
    'USA ':'USA', 'USA':'USA', 'us':'USA', 'usa':'USA', 'Us':'USA', 'US':'USA', 
    'Murica':'USA', 'United States':'USA', 'united states':'USA', 
    'Usa':'USA', 'United States ':'USA', 'United staes':'USA', 
    'United States of America':'USA', 'United states':'USA', 
    'u.s.a.':'USA', 'United States of America ':'USA', 
    'america':'USA', 'U.S.A.':'USA', 'unhinged states':'USA', 
    'united states of america':'USA', 'US of A':'USA', 
    'The United States':'USA', 'North Carolina ':'USA', 
    'Unied States':'USA', 'U S':'USA', 'u.s.':'USA', 
    'The United States of America':'USA', 
    'unite states':'USA','U.S.':'USA', 'USA? Hard to tell anymore..':'USA', 
    "'merica":'USA', 'United State':'USA', 'United Sates':'USA', 
    'California':'USA', 'Unites States':'USA', 'USa':'USA', 
    'I pretend to be from Canada, but I am really from the United States.':'USA', 
    'Usa ':'USA', 'United Stated':'USA', 'New Jersey':'USA', 'United ststes':'USA', 
    'America':'USA', 'United Statss':'USA', 'murrika':'USA', 
    'USA! USA! USA!':'USA', 'USAA':'USA', 'united States ':'USA', 
    'N. America':'USA', 'USSA':'USA', 'U.S. ':'USA', 'u s a':'USA', 
    'United Statea':'USA', 'united ststes':'USA', 
    'USA USA USA!!!!':'USA'
}, inplace=True)

Code Check: You should be merging the above values together into 1 (USA) and be left with 61 unique values after this step (including the USA value).

In [49]:
# check unique values
candy['Q4: COUNTRY'].nunique()   
Out[49]:
61

Exercise_M: Combine the Canadian entries (both upper and lower case) and label them as CA. Be careful as there are extra spaces, characters, and misspellings (Can, Canae).

These values include:

'canada', 'Canada', 'canada ', 'Canada ', 'Can', 'Canae', 'Canada`', 'CANADA'
In [50]:
#Combine the Canadian entries (both upper and lower case) 
# And label them as CA
candy['Q4: COUNTRY'].replace({
    'canada':'CA', 'Canada':'CA', 'canada ':'CA', 
    'Canada ':'CA', 'Can':'CA', 'Canae':'CA', 
    'Canada`':'CA', 'CANADA':'CA'}, inplace=True)

Code Check: You should be merging 8 values together into 1 (CA) and be left with 54 unique values after this step (including the CA value).

In [51]:
# check unique values
candy['Q4: COUNTRY'].nunique()     
Out[51]:
54

Exercise_N: Combine the European entries and label them as EU. Again, we are looking at the continent of Europe and not necessarily the countries that are a part of the European Union.

These values include:

'uk', 'United Kingdom', 'England', 'UK', 'france', 'finland', 'Netherlands', 'germany', 'Europe', 'U.K. ', 'Greece', 'France', 'Ireland', 'Uk', 'Germany', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland', 'Scotland ', 'The Netherlands', 'Ireland ', 'spain', 'Sweden', 'United kingdom'
In [52]:
candy['Q4: COUNTRY'].replace({
    'uk':'EU','United Kingdom':'EU', 
    'England':'EU', 'UK':'EU', 'france':'EU', 
    'finland':'EU', 'Netherlands':'EU', 
    'germany':'EU', 'Europe':'EU', 'U.K. ':'EU', 
    'Greece':'EU', 'France':'EU', 'Ireland':'EU', 
    'Uk':'EU', 'Germany':'EU', 'Scotland':'EU', 
    'UK ':'EU', 'Denmark':'EU', 'France ':'EU', 
    'Switzerland':'EU', 'Scotland ':'EU', 
    'The Netherlands':'EU', 'Ireland ':'EU', 
    'spain':'EU', 'Sweden':'EU', 
    'United kingdom':'EU'}, inplace=True)

Code Check: You should be merging 26 entries together and be left with 29 unique values after this step (including the EU value).

In [53]:
# check unique values
candy['Q4: COUNTRY'].nunique()     
Out[53]:
29

Exercise_O: Finally, combine the other entries and label them as Other.

In [54]:
#Combine the other entries and label them as 'Other'

candy['Q4: COUNTRY'].replace({
    'Murica':'Other','nan':'Other','UAE':'Other','35':'Other',
    'unhinged states':'Other','Mexico':'Other','Costa Rica':'Other','46':'Other',
    'Other':'Other','45':'Other','32':'Other','Trumpistan':'Other','Korea':'Other',
    'Japan':'Other','South africa':'Other','UD':'Other', 'South Korea':'Other',
    'Indonesia':'Other','endland':'Other', 'murrika':'Other', 'soviet canuckistan':'Other', 
    'Singapore':'Other', 'China':'Other','Taiwan':'Other', 'hong kong':'Other', 'spain':'Other', 
    'Sweden':'Other','Hong Kong':'Other','Narnia':'Other', 'subscribe to dm4uz3 on youtube':'Other', 
    "I don't know anymore":'Other','Fear and Loathing':'Other'
}, inplace=True)   

Code Check: Double check that you only have four unique values in the Q4: COUNTRY column: USA, Other, CA, and EU

In [55]:
# check unique values
candy['Q4: COUNTRY'].nunique()                
Out[55]:
4

Nite: we can also perform this with replace() as we did above, but we can also think of a way to do it in one line of code, than writing many in dictionary. Hint: You want to select everything where the Q4: COUNTRY is not USA, Canada, Europe and code those as 'Other'.

In [56]:
## Alternative option:

# All other values will be grouped as 'other'
# So we can use the following code
# ~ means, negating(take the opposite)

# candy['Q4: COUNTRY'] != 'USA', 'CA', 'EU' 

candy.loc[~candy['Q4: COUNTRY'].isin(['USA', 'CA','EU' ]), 'Q4: COUNTRY'] = 'Other'

Q6: To double check that everything was coded correctly, save the value counts of the Q4: COUNTRY column as Q6. You can check this once you run your CodeGrade check.

In [57]:
# Save the value counts of the Q4: COUNTRY column as Q6

Q6 = candy['Q4: COUNTRY'].value_counts()    
Q6
Out[57]:
USA      1955
CA        216
EU         73
Other      61
Name: Q4: COUNTRY, dtype: int64
In [58]:
candy['Q4: COUNTRY'].isnull().sum()  
Out[58]:
0

We now want to look at the Q3: AGE column. Let's look at all the unique values.¶

In [59]:
# check unique age values for the age column
candy['Q3: AGE'].nunique() 
Out[59]:
95
In [60]:
candy['Q3: AGE'].unique() 
Out[60]:
array(['44', '49', '40', '23', nan, '53', '33', '43', '56', '64', '37',
       '48', '54', '36', '45', '25', '34', '35', '38', '58', '50', '47',
       '16', '52', '63', '65', '41', '27', '31', '59', '61', '46', '42',
       '62', '29', '39', '32', '28', '69', '67', '30', '22', '51', '70',
       '24', '19', 'Old enough', '57', '60', '66', '12', 'Many', '55',
       '72', '?', '21', '11', 'no', '9', '68', '20', '6', '10', '71',
       '13', '26', '45-55', '7', '39.4', '74', '18', 'older than dirt',
       '17', '15', '8', '75', '5u', 'Enough', 'Over 50', '90', '76',
       'sixty-nine', 'ancient', '77', 'OLD', 'old', '73', '70 1/2', '14',
       'MY NAME JEFF', '4', '59 on the day after Halloween', 'old enough',
       'your mom', 'I can remember when Java was a cool new language',
       '60+'], dtype=object)

Again, this is a pretty messy column of data. This is a good example of why those that create online surveys shouldn't allow the individual to just put any value into the field. But it is now our job to clean this up.

Exercise_P: Your task is to put these values into the following categorical bins: unknown, 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+.

  • The category labels should exactly match the above.
  • Missing values should be replaced with the unknown category
  • To make things easier and avoid ambiguity, let's say that any value with text, even if we could determine the age, will be binned with the unknown category. For example: sixty-nine should be coded as unknown, 45-55 should be coded as unknown, 59 on the day after Halloween should be coded as unknown, etc.
  • Ensure that the category labels are unordered but reorder the categories so that 'unknown' is listed in the first position. This is not really needed but will help us grade your assignment. The categories should be listed as follows: Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')

First, we will replace any non-numeric value (those with text as mentioned above) with a missing value. This will allow you to turn the other values into floats so that you can bin them. Just don't forget to code the missing values as unknown when you are done. To replace the non-numeric values, run the following code:

In [61]:
# create True/False index
age_index = candy['Q3: AGE'].str.isnumeric()

# for the index, fill missing values with False
age_index = age_index.fillna(False)
In [62]:
# select Age column for only those False values from index and code as missing
candy.loc[~age_index, 'Q3: AGE'] = np.nan
In [63]:
#Viewing missing values with False indeces

candy['Q3: AGE'].isnull()
Out[63]:
1       False
2       False
3       False
4       False
5        True
        ...  
2474    False
2475    False
2476    False
2477    False
2478    False
Name: Q3: AGE, Length: 2305, dtype: bool
In [64]:
#Number of missing values in 'Q3: AGE' column
candy['Q3: AGE'].isnull().sum()
Out[64]:
60
In [65]:
#Convert to float dtype

candy['Q3: AGE'] = candy['Q3: AGE'].astype(float)
candy['Q3: AGE']
Out[65]:
1       44.0
2       49.0
3       40.0
4       23.0
5        NaN
        ... 
2474    24.0
2475    33.0
2476    26.0
2477    58.0
2478    66.0
Name: Q3: AGE, Length: 2305, dtype: float64
In [66]:
##Your task is to put these values into the following categorical bins: 
# unknown, 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+
    
## We firt Create the following bins for the 'Q3: AGE' column:
# 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+
    
candy.loc[:,'Q3: AGE'] = pd.cut(candy['Q3: AGE'], bins=[0, 17, 25, 35, 45, 55, 100],
                          labels=['17 and under','18-25', '26-35', '36-45', '46-55', '56+'])
In [67]:
#Check code
candy['Q3: AGE'] 
Out[67]:
1       36-45
2       46-55
3       36-45
4       18-25
5         NaN
        ...  
2474    18-25
2475    26-35
2476    26-35
2477      56+
2478      56+
Name: Q3: AGE, Length: 2305, dtype: category
Categories (6, object): ['17 and under' < '18-25' < '26-35' < '36-45' < '46-55' < '56+']
In [68]:
## add 'unknown' category in the above df(labels)

candy['Q3: AGE'] = candy['Q3: AGE'].cat.add_categories('unknown').cat.reorder_categories(['unknown', '17 and under','18-25', '26-35', '36-45', '46-55', '56+'])

# fill the NaN missing values with 'unknown'

candy.loc[:,'Q3: AGE'] = candy.loc[:,'Q3: AGE'].astype('category').fillna('unknown')
 
In [69]:
candy['Q3: AGE']
Out[69]:
1         36-45
2         46-55
3         36-45
4         18-25
5       unknown
         ...   
2474      18-25
2475      26-35
2476      26-35
2477        56+
2478        56+
Name: Q3: AGE, Length: 2305, dtype: category
Categories (7, object): ['unknown' < '17 and under' < '18-25' < '26-35' < '36-45' < '46-55' < '56+']

Exercise_Q: Double check yourself by checking the categories for the Q3: AGE column. It should output: Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')

In [70]:
#checking the categories for the Q3: AGE column

bins = [0, 18, 25, 35, 45, 55, 100]
labels =['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
      
index = pd.DataFrame(candy['Q3: AGE'], labels, bins)
index.index.astype(object)
Out[70]:
Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')

Code Check: To double check your above binning worked correctly, your value counts (sorted by the index) should be as follows:

unknown: 60 
17 and under: 49 
18-25: 85
26-35: 520
36-45: 768
46-55: 525
56+: 298
In [71]:
#Sort values for candy['Q3: AGE'] column, as above

candy['Q3: AGE'].value_counts().sort_index(ascending = True)
Out[71]:
unknown          60
17 and under     49
18-25            85
26-35           520
36-45           768
46-55           525
56+             298
Name: Q3: AGE, dtype: int64

You can also double check some of your work up to this point by making sure that there are no missing values in the data set anymore.

In [72]:
#Checking missing value

candy['Q3: AGE'].isnull().sum()
Out[72]:
0

Code Check: Check to see if there are any missing values in the data set. Your output should show 0.

In [73]:
#Check to see if there are any missing values in the data set. 
# The output should show 0

candy.isnull().sum()
Out[73]:
Q1: GOING OUT?             0
Q2: GENDER                 0
Q3: AGE                    0
Q4: COUNTRY                0
Q6 | 100 Grand Bar         0
                          ..
Q11: DAY                   0
Q12: MEDIA [Daily Dish]    0
Q12: MEDIA [Science]       0
Q12: MEDIA [ESPN]          0
Q12: MEDIA [Yahoo]         0
Length: 113, dtype: int64

Exercise_R: Before you move on to the Data Analysis section, reset the index for candy ensuring that it goes from 0 to n-1.

In [74]:
### Reset index for candy ensuring that it goes from 0 to n-1

# We apply the (start, stop, step) method 

candy.index = pd.RangeIndex(start=0, stop=len(candy1), step=1)
candy
Out[74]:
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) ... Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
0 No Male 36-45 USA MEH DESPAIR JOY MEH DESPAIR DESPAIR ... DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
1 No Male 46-55 USA NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
2 No Male 36-45 USA MEH DESPAIR JOY MEH MEH DESPAIR ... JOY DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
3 No Male 18-25 USA JOY DESPAIR JOY DESPAIR MEH DESPAIR ... JOY DESPAIR DESPAIR JOY White and gold Friday 0.0 1.0 0.0 0.0
4 No Male unknown Other JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER ... JOY DESPAIR DESPAIR JOY NO_ANSWER NO_ANSWER 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2300 No Male 18-25 USA JOY DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR MEH DESPAIR MEH White and gold Friday 0.0 0.0 0.0 0.0
2301 No Female 26-35 USA MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER ... JOY DESPAIR MEH JOY Blue and black Friday 0.0 1.0 0.0 0.0
2302 No Female 26-35 USA MEH DESPAIR JOY DESPAIR MEH JOY ... MEH DESPAIR DESPAIR MEH Blue and black Friday 0.0 1.0 0.0 0.0
2303 No Male 56+ USA NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
2304 No Female 56+ USA DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR JOY White and gold Sunday 1.0 0.0 0.0 0.0

2305 rows × 113 columns

In [75]:
### Option 2:

#Reset index and retain old index as a column
# candy = candy.reset_index(drop=True)

# candy

I would suggest that you stop here and run your code through CodeGrade to check the previous steps before continuing. Just keep in mind that there are some preparation steps below that will be marked incorrect because you have not yet gotten to them.

Data Analysis¶

Make sure that you are ready to answer any of the following questions about the data set that may appear on your quiz. Please use the cleaned, final candy data to answer these questions. Note that the answers here may be different than any that appear in the article about this data set or that could be found using Excel. Ours has been altered and cleaned in a different way than the original authors did. Also, please do not use Excel to try to find these answers. First, you may not get the correct answer, and more importantly, we want you to practice your Pandas skills.

Exercise_S: How many rows were in the original, uncleaned data that you imported? How many rows are in the cleaned data? How many did we end up removing from the data set?

In [76]:
##There were 2479 rows in the original uncleaned data that we imported
# 2479 rows × 120 columns

candy_full.shape
Out[76]:
(2479, 120)
In [77]:
##we end up with 2305 rows, removing/cleaning from the old data set
#2305 rows × 113 columns

candy.shape
Out[77]:
(2305, 113)
In [78]:
# How many did we end up removing from the data set?  
# Rows = 2479 rows  - 2305 rows = 174 rows

pd.concat([candy_full, candy]).loc[candy_full.index.symmetric_difference(candy.index)]
Out[78]:
Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks ... Q9: OTHER COMMENTS Q10: DRESS Unnamed: 113 Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y) Q6_new | Hershey's Milk Chocolate
2305 90291228.0 Yes Female 48 USA KS JOY MEH JOY DESPAIR ... Smarties (Commonwealth) are awesome. Sweetums ... Blue and black NaN Friday 1.0 NaN NaN NaN (31, 5) NaN
2306 90291230.0 No Female 60 United States California JOY DESPAIR JOY MEH ... NaN Blue and black NaN Friday NaN 1.0 NaN NaN (58, 6) NaN
2307 90291318.0 NaN Male 53 United States Illinois MEH DESPAIR JOY MEH ... NaN White and gold NaN Friday NaN 1.0 NaN NaN (80, 31) NaN
2308 90291319.0 No Male 37 us IL NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2309 90291458.0 Yes Male 42 USA OH JOY DESPAIR JOY MEH ... NaN White and gold NaN Friday NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2474 90314359.0 No Male 24 USA MD JOY DESPAIR MEH DESPAIR ... NaN White and gold NaN Friday NaN NaN NaN NaN NaN NaN
2475 90314580.0 No Female 33 USA New York MEH DESPAIR JOY NaN ... NaN Blue and black NaN Friday NaN 1.0 NaN NaN (70, 26) NaN
2476 90314634.0 No Female 26 USA Tennessee MEH DESPAIR JOY DESPAIR ... NaN Blue and black NaN Friday NaN 1.0 NaN NaN (67, 35) NaN
2477 90314658.0 No Male 58 Usa North Carolina NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2478 90314802.0 No Female 66 usa Pennsylvania DESPAIR DESPAIR JOY DESPAIR ... You hit all my chocolate highlights, and broug... White and gold NaN Sunday 1.0 NaN NaN NaN (19, 26) NaN

174 rows × 121 columns

Exercise_T: What percentage of respondents are planning to go out trick-or-treating? (Again, make sure that you are using the final, cleaned data for this and all the following questions.)

In [79]:
# number of respondents planning to go out

candy['Q1: GOING OUT?'].value_counts()   #-- 298 number of respondents  
Out[79]:
No     2007
Yes     298
Name: Q1: GOING OUT?, dtype: int64
In [80]:
# percentages of respondents planning to go out


print(candy['Q1: GOING OUT?'].value_counts(normalize=True).sort_index(ascending = False))

print('======================================================')

print('About 12% or 13% respondents are planning to go out')
Yes    0.129284
No     0.870716
Name: Q1: GOING OUT?, dtype: float64
======================================================
About 12% or 13% respondents are planning to go out

Exercise_U: What percentage of respondents 17 and younger are planning to go out for trick-or-treating?

In [81]:
#percentage of respondents 17 and younger 

# Answer: 0.2 (2%)

print(candy['Q3: AGE'].value_counts(normalize=True).sort_index(ascending = True))

print('==========================================================')

print('About 2% respondents 17 and younger are planning to go out')
unknown         0.026030
17 and under    0.021258
18-25           0.036876
26-35           0.225597
36-45           0.333189
46-55           0.227766
56+             0.129284
Name: Q3: AGE, dtype: float64
==========================================================
About 2% respondents 17 and younger are planning to go out

Exercise_V: What are the total value counts and the normalized percentages of the age categories from the age column?

In [82]:
#Total value counts and the normalized percentages of the age categories

print(candy['Q3: AGE'].value_counts().sum())

print('============================')

print(candy['Q3: AGE'].value_counts(normalize=True).sort_index(ascending = True))
2305
============================
unknown         0.026030
17 and under    0.021258
18-25           0.036876
26-35           0.225597
36-45           0.333189
46-55           0.227766
56+             0.129284
Name: Q3: AGE, dtype: float64
In [83]:
# We can print this for a list of column names

for col in candy.columns:
    print(col)
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers	(a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box'o'Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N' Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey's Dark Chocolate
Q6_new | Hershey's Milk Chocolate
Q6 | Hershey's Kisses
Q6 | Hugs (actual physical hugs)
Q6 | Jolly Rancher (bad flavor)
Q6 | Jolly Ranchers (good flavor)
Q6 | JoyJoy (Mit Iodine!)
Q6 | Junior Mints
Q6 | Senior Mints
Q6 | Kale smoothie
Q6 | Kinder Happy Hippo
Q6 | Kit Kat
Q6 | LaffyTaffy
Q6 | LemonHeads
Q6 | Licorice (not black)
Q6 | Licorice (yes black)
Q6 | Lindt Truffle
Q6 | Lollipops
Q6 | Mars
Q6 | Maynards
Q6 | Mike and Ike
Q6 | Milk Duds
Q6 | Milky Way
Q6 | Regular M&Ms
Q6 | Peanut M&MÕs
Q6 | Blue M&M's
Q6 | Red M&M's
Q6 | Green Party M&M's
Q6 | Independent M&M's
Q6 | Abstained from M&M'ing.
Q6 | Minibags of chips
Q6 | Mint Kisses
Q6 | Mint Juleps
Q6 | Mr. Goodbar
Q6 | Necco Wafers
Q6 | Nerds
Q6 | Nestle Crunch
Q6 | Now'n'Laters
Q6 | Peeps
Q6 | Pencils
Q6 | Pixy Stix
Q6 | Real Housewives of Orange County Season 9 Blue-Ray
Q6 | ReeseÕs Peanut Butter Cups
Q6 | Reese's Pieces
Q6 | Reggie Jackson Bar
Q6 | Rolos
Q6 | Sandwich-sized bags filled with BooBerry Crunch
Q6 | Skittles
Q6 | Smarties (American)
Q6 | Smarties (Commonwealth)
Q6 | Snickers
Q6 | Sourpatch Kids (i.e. abominations of nature)
Q6 | Spotted Dick
Q6 | Starburst
Q6 | Sweet Tarts
Q6 | Swedish Fish
Q6 | Sweetums (a friend to diabetes)
Q6 | Take 5
Q6 | Tic Tacs
Q6 | Those odd marshmallow circus peanut things
Q6 | Three Musketeers
Q6 | Tolberone something or other
Q6 | Trail Mix
Q6 | Twix
Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
Q6 | Vicodin
Q6 | Whatchamacallit Bars
Q6 | White Bread
Q6 | Whole Wheat anything
Q6 | York Peppermint Patties
Q10: DRESS
Q11: DAY
Q12: MEDIA [Daily Dish]
Q12: MEDIA [Science]
Q12: MEDIA [ESPN]
Q12: MEDIA [Yahoo]

Exercise_W: What are the total counts and percentages for the gender column?

In [84]:
#Total counts for the gender column
##Total percentages for the gender column

print(candy['Q2: GENDER'].value_counts().sum())

print('========================')

print(candy['Q2: GENDER'].value_counts(normalize=True).sort_index(ascending = True))
2305
========================
Female    0.363991
Male      0.636009
Name: Q2: GENDER, dtype: float64

Exercise_X: What are the breakdown of counts for the country column?

In [85]:
# Breakdown of counts for the country column

candy['Q4: COUNTRY'].value_counts()
Out[85]:
USA      1955
CA        216
EU         73
Other      61
Name: Q4: COUNTRY, dtype: int64

Exercise_Y: How many total respondents voted joy in candy corn and how many voted despair? Did more people vote joy or despair for candy corn?

In [86]:
#Total vote respondents for JOY and DESPAIR in candy corn 
#Total percentages of vote respondents

print(candy['Q6 | Candy Corn'].value_counts())

print('=================================')

print(candy['Q6 | Candy Corn'].value_counts(normalize=True).round(2))
DESPAIR      702
NO_ANSWER    620
MEH          529
JOY          454
Name: Q6 | Candy Corn, dtype: int64
=================================
DESPAIR      0.30
NO_ANSWER    0.27
MEH          0.23
JOY          0.20
Name: Q6 | Candy Corn, dtype: float64
In [87]:
#Total percentages of vote respondents for JOY and DESPAIR in candy corn 

print(candy['Q6 | Candy Corn'].value_counts(normalize=True))

print('=================================================')

print('More people voted for DESPAIR in candy corn = 30% ') 
DESPAIR      0.304555
NO_ANSWER    0.268980
MEH          0.229501
JOY          0.196963
Name: Q6 | Candy Corn, dtype: float64
=================================================
More people voted for DESPAIR in candy corn = 30% 

Exercise_Z: How many people voted joy in Reese's Peanut Butter Cups? In Snickers? Did more people vote joy for Reese's Peanut Butter Cups or for Snickers?

In [88]:
# candy['Q6 | Reese\'s Peanut Butter Cups'].value_counts()    #check Reese\'s Peanut Butter Cups spelling
In [89]:
candy['Q6 | Snickers'].value_counts()  
Out[89]:
JOY          1325
NO_ANSWER     633
MEH           273
DESPAIR        74
Name: Q6 | Snickers, dtype: int64

Exercise_AA: How many people voted joy in Twix? In Kit Kats? Did more people vote joy for Twix or for Kit Kats?

In [90]:
candy['Q6 | Twix'].value_counts()  
Out[90]:
JOY          1339
NO_ANSWER     630
MEH           269
DESPAIR        67
Name: Q6 | Twix, dtype: int64
In [91]:
candy['Q6 | Kit Kat'].value_counts()  
Out[91]:
JOY          1367
NO_ANSWER     616
MEH           275
DESPAIR        47
Name: Q6 | Kit Kat, dtype: int64

Exercise_AB: How many people voted joy in white bread? For whole wheat items? Did more people vote joy for white bread or whole wheat items?

In [92]:
print(candy['Q6 | White Bread'].value_counts())  
DESPAIR      1376
NO_ANSWER     694
MEH           192
JOY            43
Name: Q6 | White Bread, dtype: int64
In [93]:
print(candy['Q6 | Whole Wheat anything'].value_counts())  

print('===============================================')

print('More people voted joy on whole wheat items')
DESPAIR      1233
NO_ANSWER     684
MEH           278
JOY           110
Name: Q6 | Whole Wheat anything, dtype: int64
===============================================
More people voted joy on whole wheat items

Exercise_AC: How many people voted joy for Bonkers the board game? For Bonkers the candy? Did more people vote joy for the board game or for the candy?

In [94]:
candy['Q6 | Bonkers (the board game)'].value_counts()  
Out[94]:
NO_ANSWER    926
MEH          673
DESPAIR      518
JOY          188
Name: Q6 | Bonkers (the board game), dtype: int64
In [95]:
print(candy['Q6 | Bonkers (the candy)'].value_counts()) 

print('===============================================')

print('More people voted JOY for the board game than the candy = 188')
NO_ANSWER    911
MEH          818
DESPAIR      467
JOY          109
Name: Q6 | Bonkers (the candy), dtype: int64
===============================================
More people voted JOY for the board game than the candy = 188

Exercise_AD: How many people voted joy for a box of raisins? For the Blue-Ray DVD of the Real Housewives of Orange County Season 9? Did more people vote joy for a box of raisins or for the DVD?

In [96]:
# candy['Box Of Raisins'].value_counts()  
In [97]:
candy['Q6 | Candy Corn'].value_counts()  
Out[97]:
DESPAIR      702
NO_ANSWER    620
MEH          529
JOY          454
Name: Q6 | Candy Corn, dtype: int64

Exercise_AE: What is the favorite day of the week for the respondents (both by total counts and percentages)?

In [98]:
# favorite day of the week by total counts

print(candy['Q11: DAY'].value_counts())
Friday       1026
NO_ANSWER     658
Sunday        621
Name: Q11: DAY, dtype: int64
In [99]:
# favorite day of the week by percentage

print(candy['Q11: DAY'].value_counts(normalize=True).round(2))

print('===============================================')

print('Friday is the favorite day of the week for the respondents, both by total counts = 1026, and percentages = 45% )')
Friday       0.45
NO_ANSWER    0.29
Sunday       0.27
Name: Q11: DAY, dtype: float64
===============================================
Friday is the favorite day of the week for the respondents, both by total counts = 1026, and percentages = 45% )

Exercise_AF: Do more respondents see 'white and gold' or 'blue and black' for the color of the dress (both total counts and percentages)?

In [100]:
# color of the dress (total counts)

print(candy['Q10: DRESS'].value_counts())
White and gold    1027
NO_ANSWER          679
Blue and black     599
Name: Q10: DRESS, dtype: int64
In [101]:
# color of the dress (percentages)

print(candy['Q10: DRESS'].value_counts(normalize=True).round(2))

print('===========================================================================')  

print('More respondents see White and gold for the color of the dress (1027 = 45%)')
White and gold    0.45
NO_ANSWER         0.29
Blue and black    0.26
Name: Q10: DRESS, dtype: float64
===========================================================================
More respondents see White and gold for the color of the dress (1027 = 45%)

Exercise_AG: For those respondents that clicked on the media link (listed as Q12 columns on the survey), which link did they click on the most?

In [102]:
print(candy['Q12: MEDIA [Yahoo]'].value_counts())
0.0    2244
1.0      61
Name: Q12: MEDIA [Yahoo], dtype: int64

Exercise_AH: How many males voted joy for receiving a copy of the Real Housewives of Orange County Season 9 DVD for Halloween? Females? Those 17 or younger?

In [103]:
# number of Males and Females voted joy

print(candy[['Q2: GENDER', 'Q6 | Real Housewives of Orange County Season 9 Blue-Ray']].value_counts())

print('====================================')  

print('45 males and 41 Females voted joy')
Q2: GENDER  Q6 | Real Housewives of Orange County Season 9 Blue-Ray
Male        DESPAIR                                                    901
Female      DESPAIR                                                    497
Male        NO_ANSWER                                                  439
Female      NO_ANSWER                                                  253
Male        MEH                                                         81
Female      MEH                                                         48
Male        JOY                                                         45
Female      JOY                                                         41
dtype: int64
====================================
45 males and 41 Females voted joy
In [104]:
# number of respondents 17 or younger who voted joy

print(candy[['Q3: AGE','Q6 | Real Housewives of Orange County Season 9 Blue-Ray']].value_counts())

print('====================================')  

print('2 respondents 17 or younger voted joy')
Q3: AGE       Q6 | Real Housewives of Orange County Season 9 Blue-Ray
36-45         DESPAIR                                                    482
46-55         DESPAIR                                                    377
26-35         DESPAIR                                                    273
36-45         NO_ANSWER                                                  224
26-35         NO_ANSWER                                                  187
56+           DESPAIR                                                    177
46-55         NO_ANSWER                                                  109
56+           NO_ANSWER                                                   91
36-45         MEH                                                         38
18-25         NO_ANSWER                                                   35
              DESPAIR                                                     35
26-35         MEH                                                         33
unknown       NO_ANSWER                                                   29
26-35         JOY                                                         27
17 and under  DESPAIR                                                     27
unknown       DESPAIR                                                     27
46-55         MEH                                                         26
36-45         JOY                                                         24
56+           MEH                                                         18
17 and under  NO_ANSWER                                                   17
46-55         JOY                                                         13
56+           JOY                                                         12
18-25         MEH                                                         10
              JOY                                                          5
unknown       JOY                                                          3
17 and under  MEH                                                          3
              JOY                                                          2
unknown       MEH                                                          1
dtype: int64
====================================
2 respondents 17 or younger voted joy

Exercise_AI: The authors tried a funny way to determine a respondent's political leaning. Note this was purely a joke and not meant to be scientific. How many total respondents voted joy in the following: Blue M&M's, Red M&M's, Green Party M&M's, Independent M&M's, and Abstained from M&M'ing.

In [105]:
## number of total respondents voted joy in the following: 
   #Blue M&M's, Red M&M's, Green Party M&M's, Independent M&M's, and Abstained from M&M'ing

print(pd.DataFrame(candy, columns = ['Q6 | Blue M&M\'s']).value_counts())
print('==============================================') 
print('963 total respondents voted joy in Blue M&M\'s')
print('==============================================')
print(pd.DataFrame(candy, columns = ['Q6 | Red M&M\'s']).value_counts())
print('==============================================') 
print('949 total respondents voted joy in Red M&M\'s')
print('==============================================') 
print(pd.DataFrame(candy, columns = ['Q6 | Green Party M&M\'s']).value_counts())
print('=====================================================') 
print('915 total respondents voted joy in Green Party M&M\'s')
print('=====================================================') 
print(pd.DataFrame(candy, columns = ['Q6 | Independent M&M\'s']).value_counts())
print('======================================================') 
print('711 total respondents voted joy in Independent M&M\'s')
print('======================================================') 
print(pd.DataFrame(candy, columns = ['Q6 | Abstained from M&M\'ing.']).value_counts())
print('==========================================================') 
print('202 total respondents voted joy in Abstained from M&M\'ing.')
Q6 | Blue M&M's
JOY                963
NO_ANSWER          667
MEH                563
DESPAIR            112
dtype: int64
==============================================
963 total respondents voted joy in Blue M&M's
==============================================
Q6 | Red M&M's
JOY               949
NO_ANSWER         669
MEH               574
DESPAIR           113
dtype: int64
==============================================
949 total respondents voted joy in Red M&M's
==============================================
Q6 | Green Party M&M's
JOY                       915
NO_ANSWER                 701
MEH                       561
DESPAIR                   128
dtype: int64
=====================================================
915 total respondents voted joy in Green Party M&M's
=====================================================
Q6 | Independent M&M's
NO_ANSWER                 749
JOY                       711
MEH                       592
DESPAIR                   253
dtype: int64
======================================================
711 total respondents voted joy in Independent M&M's
======================================================
Q6 | Abstained from M&M'ing.
NO_ANSWER                       868
DESPAIR                         662
MEH                             573
JOY                             202
dtype: int64
==========================================================
202 total respondents voted joy in Abstained from M&M'ing.
In [106]:
# We can print this for a list of column names

for col in candy.columns:
    print(col)
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers	(a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box'o'Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N' Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey's Dark Chocolate
Q6_new | Hershey's Milk Chocolate
Q6 | Hershey's Kisses
Q6 | Hugs (actual physical hugs)
Q6 | Jolly Rancher (bad flavor)
Q6 | Jolly Ranchers (good flavor)
Q6 | JoyJoy (Mit Iodine!)
Q6 | Junior Mints
Q6 | Senior Mints
Q6 | Kale smoothie
Q6 | Kinder Happy Hippo
Q6 | Kit Kat
Q6 | LaffyTaffy
Q6 | LemonHeads
Q6 | Licorice (not black)
Q6 | Licorice (yes black)
Q6 | Lindt Truffle
Q6 | Lollipops
Q6 | Mars
Q6 | Maynards
Q6 | Mike and Ike
Q6 | Milk Duds
Q6 | Milky Way
Q6 | Regular M&Ms
Q6 | Peanut M&MÕs
Q6 | Blue M&M's
Q6 | Red M&M's
Q6 | Green Party M&M's
Q6 | Independent M&M's
Q6 | Abstained from M&M'ing.
Q6 | Minibags of chips
Q6 | Mint Kisses
Q6 | Mint Juleps
Q6 | Mr. Goodbar
Q6 | Necco Wafers
Q6 | Nerds
Q6 | Nestle Crunch
Q6 | Now'n'Laters
Q6 | Peeps
Q6 | Pencils
Q6 | Pixy Stix
Q6 | Real Housewives of Orange County Season 9 Blue-Ray
Q6 | ReeseÕs Peanut Butter Cups
Q6 | Reese's Pieces
Q6 | Reggie Jackson Bar
Q6 | Rolos
Q6 | Sandwich-sized bags filled with BooBerry Crunch
Q6 | Skittles
Q6 | Smarties (American)
Q6 | Smarties (Commonwealth)
Q6 | Snickers
Q6 | Sourpatch Kids (i.e. abominations of nature)
Q6 | Spotted Dick
Q6 | Starburst
Q6 | Sweet Tarts
Q6 | Swedish Fish
Q6 | Sweetums (a friend to diabetes)
Q6 | Take 5
Q6 | Tic Tacs
Q6 | Those odd marshmallow circus peanut things
Q6 | Three Musketeers
Q6 | Tolberone something or other
Q6 | Trail Mix
Q6 | Twix
Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
Q6 | Vicodin
Q6 | Whatchamacallit Bars
Q6 | White Bread
Q6 | Whole Wheat anything
Q6 | York Peppermint Patties
Q10: DRESS
Q11: DAY
Q12: MEDIA [Daily Dish]
Q12: MEDIA [Science]
Q12: MEDIA [ESPN]
Q12: MEDIA [Yahoo]

Exercise_AJ: Select only the Q6 candy columns (Q6 | 100 Grand Bar through Q6 | York Peppermint Patties) in the data set and save this as a new DataFrame called candy_reduced.

In [107]:
#Select Q6 candy columns (Q6 | 100 Grand Bar through Q6 | York Peppermint Patties) 
# And save this as a new DataFrame called candy_reduced

candy_reduced = candy.loc[:, 'Q6 | 100 Grand Bar':'Q6 | York Peppermint Patties']
candy_reduced
Out[107]:
Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger ... Q6 | Three Musketeers Q6 | Tolberone something or other Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties
0 MEH DESPAIR JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR ... JOY JOY DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR
1 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER
2 MEH DESPAIR JOY MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR JOY MEH JOY DESPAIR JOY JOY DESPAIR DESPAIR DESPAIR
3 JOY DESPAIR JOY DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH ... JOY JOY DESPAIR JOY MEH JOY JOY DESPAIR DESPAIR JOY
4 JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER MEH MEH DESPAIR JOY ... JOY JOY MEH JOY DESPAIR DESPAIR JOY DESPAIR DESPAIR JOY
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2300 JOY DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR MEH ... MEH MEH JOY JOY MEH JOY DESPAIR MEH DESPAIR MEH
2301 MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER DESPAIR DESPAIR JOY ... MEH MEH DESPAIR JOY NO_ANSWER NO_ANSWER JOY DESPAIR MEH JOY
2302 MEH DESPAIR JOY DESPAIR MEH JOY DESPAIR MEH MEH DESPAIR ... JOY JOY MEH MEH MEH JOY MEH DESPAIR DESPAIR MEH
2303 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER
2304 DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR JOY ... JOY MEH DESPAIR DESPAIR DESPAIR JOY DESPAIR MEH DESPAIR JOY

2305 rows × 103 columns

Exercise_AK: Determine what candy/item from the candy_reduced DataFrame has the most number of JOY votes and the least number of JOY votes. A simple way to do this is to filter the entire DataFrame for any JOY values, then use count(), then sort the values in descending order. See this stackoverflow question and answers.

In [108]:
candy_reduced[candy_reduced == 'JOY'].count().sort_values(ascending=False)
Out[108]:
Q6 | Any full-sized candy bar                                                  1477
Q6 | ReeseÕs Peanut Butter Cups                                                1416
Q6 | Kit Kat                                                                   1367
Q6 | Cash, or other forms of legal tender                                      1363
Q6 | Twix                                                                      1339
                                                                               ... 
Q6 | JoyJoy (Mit Iodine!)                                                        72
Q6 | Gum from baseball cards                                                     43
Q6 | White Bread                                                                 43
Q6 | Candy that is clearly just the stuff given out for free at restaurants      37
Q6 | Broken glow stick                                                           24
Length: 103, dtype: int64

Exercise_AL: Using the above as an example, what candy/item has the most DESPAIR votes?

In [109]:
candy_reduced[candy_reduced == 'DESPAIR'].count().sort_values(ascending=False)
Out[109]:
Q6 | Broken glow stick                                     1535
Q6 | Real Housewives of Orange County Season 9 Blue-Ray    1398
Q6 | Gum from baseball cards                               1386
Q6 | White Bread                                           1376
Q6 | Kale smoothie                                         1365
                                                           ... 
Q6 | Regular M&Ms                                            71
Q6 | Twix                                                    67
Q6 | Cash, or other forms of legal tender                    63
Q6 | Kit Kat                                                 47
Q6 | Any full-sized candy bar                                15
Length: 103, dtype: int64

Exercise_AM: What candy/item has the most MEH votes?

In [110]:
candy_reduced[candy_reduced == 'MEH'].count().sort_values(ascending=False)
Out[110]:
Q6 | Lollipops                                             877
Q6 | Hard Candy                                            855
Q6 | Bonkers (the candy)                                   818
Q6 | Minibags of chips                                     718
Q6 | 100 Grand Bar                                         715
                                                          ... 
Q6 | ReeseÕs Peanut Butter Cups                            178
Q6 | Kale smoothie                                         162
Q6 | Real Housewives of Orange County Season 9 Blue-Ray    129
Q6 | Broken glow stick                                      99
Q6 | Creepy Religious comics/Chick Tracts                   95
Length: 103, dtype: int64

Exercise_AN: What candy/item did the most people not recognize or have an opinion on? (According to the survey, the respondents were asked to leave a question blank if they did not know the item)

In [111]:
candy_reduced[candy_reduced == 'NO_ANSWER'].count().sort_values(ascending=False)
Out[111]:
Q6 | JoyJoy (Mit Iodine!)               942
Q6 | Maynards                           939
Q6 | Reggie Jackson Bar                 933
Q6 | Bonkers (the board game)           926
Q6 | Sweetums (a friend to diabetes)    924
                                       ... 
Q6 | Kit Kat                            616
Q6 | Any full-sized candy bar           615
Q6 | Hershey's Dark Chocolate           615
Q6_new | Hershey's Milk Chocolate       614
Q6 | Peanut M&MÕs                       614
Length: 103, dtype: int64

In the final piece of the analysis, we will determine what candy/items have the most and lowest "net_feelies" (calculated by the authors as the total joy count minus the total despair count).

First, we will create two Series, one with JOY counts and one with DESPAIR counts to add to our candy_reduced data.

Exercise_AO: Create a Series called joy_count that lists total counts for JOY for each column, making sure to keep it in the same order as the columns in the candy_reduced DataFrame. Hint: This should be almost exactly how we determined which candy/items had the most JOY votes, but we would not do any sorting.

In [112]:
joy_count = pd.Series(candy_reduced[candy_reduced == 'JOY'].count())
joy_count

# candy_reduced[candy_reduced == 'JOY'].count()
Out[112]:
Q6 | 100 Grand Bar                                                                         828
Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)     165
Q6 | Any full-sized candy bar                                                             1477
Q6 | Black Jacks                                                                            87
Q6 | Bonkers (the candy)                                                                   109
                                                                                          ... 
Q6 | Vicodin                                                                               674
Q6 | Whatchamacallit Bars                                                                  802
Q6 | White Bread                                                                            43
Q6 | Whole Wheat anything                                                                  110
Q6 | York Peppermint Patties                                                              1040
Length: 103, dtype: int64

Exercise_AP: Same as above except you will create a Series called despair_count that lists the total counts for DESPAIR for each column.

In [113]:
despair_count = pd.Series(candy_reduced[candy_reduced == 'DESPAIR'].count())
despair_count
Out[113]:
Q6 | 100 Grand Bar                                                                          80
Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)    1017
Q6 | Any full-sized candy bar                                                               15
Q6 | Black Jacks                                                                           754
Q6 | Bonkers (the candy)                                                                   467
                                                                                          ... 
Q6 | Vicodin                                                                               679
Q6 | Whatchamacallit Bars                                                                  273
Q6 | White Bread                                                                          1376
Q6 | Whole Wheat anything                                                                 1233
Q6 | York Peppermint Patties                                                               222
Length: 103, dtype: int64

Exercise_AQ: Take the transpose of the candy_reduced DataFrame and save this transposed data as candy_reduced_transpose.

In [114]:
candy_reduced_transpose = candy_reduced.T
candy_reduced_transpose = candy_reduced_transpose.reset_index()
candy_reduced_transpose
Out[114]:
index 0 1 2 3 4 5 6 7 8 ... 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304
0 Q6 | 100 Grand Bar MEH NO_ANSWER MEH JOY JOY NO_ANSWER JOY JOY MEH ... NO_ANSWER NO_ANSWER JOY MEH JOY JOY MEH MEH NO_ANSWER DESPAIR
1 Q6 | Anonymous brown globs that come in black ... DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... NO_ANSWER NO_ANSWER DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR
2 Q6 | Any full-sized candy bar JOY NO_ANSWER JOY JOY JOY NO_ANSWER JOY JOY JOY ... NO_ANSWER NO_ANSWER JOY JOY JOY MEH JOY JOY NO_ANSWER JOY
3 Q6 | Black Jacks MEH NO_ANSWER MEH DESPAIR NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER NO_ANSWER DESPAIR DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR NO_ANSWER DESPAIR
4 Q6 | Bonkers (the candy) DESPAIR NO_ANSWER MEH MEH NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER NO_ANSWER MEH MEH MEH DESPAIR NO_ANSWER MEH NO_ANSWER DESPAIR
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
98 Q6 | Vicodin DESPAIR NO_ANSWER JOY JOY DESPAIR NO_ANSWER MEH DESPAIR DESPAIR ... NO_ANSWER NO_ANSWER JOY DESPAIR MEH JOY NO_ANSWER JOY NO_ANSWER JOY
99 Q6 | Whatchamacallit Bars DESPAIR NO_ANSWER JOY JOY JOY NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER NO_ANSWER MEH NO_ANSWER JOY DESPAIR JOY MEH NO_ANSWER DESPAIR
100 Q6 | White Bread DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... NO_ANSWER NO_ANSWER DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR NO_ANSWER MEH
101 Q6 | Whole Wheat anything DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... NO_ANSWER NO_ANSWER MEH DESPAIR MEH DESPAIR MEH DESPAIR NO_ANSWER DESPAIR
102 Q6 | York Peppermint Patties DESPAIR NO_ANSWER DESPAIR JOY JOY NO_ANSWER DESPAIR DESPAIR MEH ... NO_ANSWER NO_ANSWER MEH JOY JOY MEH JOY MEH NO_ANSWER JOY

103 rows × 2306 columns

Exercise_AR: Add a new column called "joy_count" using the joy_count Series above and a new column called 'despair_count" using the despair_count Series above to the candy_reduced_transpose DataFrame.

In [115]:
## Add a new column called "joy_count" using the `joy_count` Series above to the candy_reduced_transpose DataFrame
candy_reduced_transpose['joy_count'] = candy_reduced_transpose['index'].map(joy_count)
candy_reduced_transpose
Out[115]:
index 0 1 2 3 4 5 6 7 8 ... 2296 2297 2298 2299 2300 2301 2302 2303 2304 joy_count
0 Q6 | 100 Grand Bar MEH NO_ANSWER MEH JOY JOY NO_ANSWER JOY JOY MEH ... NO_ANSWER JOY MEH JOY JOY MEH MEH NO_ANSWER DESPAIR 828
1 Q6 | Anonymous brown globs that come in black ... DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... NO_ANSWER DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR 165
2 Q6 | Any full-sized candy bar JOY NO_ANSWER JOY JOY JOY NO_ANSWER JOY JOY JOY ... NO_ANSWER JOY JOY JOY MEH JOY JOY NO_ANSWER JOY 1477
3 Q6 | Black Jacks MEH NO_ANSWER MEH DESPAIR NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER DESPAIR DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR NO_ANSWER DESPAIR 87
4 Q6 | Bonkers (the candy) DESPAIR NO_ANSWER MEH MEH NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER MEH MEH MEH DESPAIR NO_ANSWER MEH NO_ANSWER DESPAIR 109
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
98 Q6 | Vicodin DESPAIR NO_ANSWER JOY JOY DESPAIR NO_ANSWER MEH DESPAIR DESPAIR ... NO_ANSWER JOY DESPAIR MEH JOY NO_ANSWER JOY NO_ANSWER JOY 674
99 Q6 | Whatchamacallit Bars DESPAIR NO_ANSWER JOY JOY JOY NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER MEH NO_ANSWER JOY DESPAIR JOY MEH NO_ANSWER DESPAIR 802
100 Q6 | White Bread DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... NO_ANSWER DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR NO_ANSWER MEH 43
101 Q6 | Whole Wheat anything DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... NO_ANSWER MEH DESPAIR MEH DESPAIR MEH DESPAIR NO_ANSWER DESPAIR 110
102 Q6 | York Peppermint Patties DESPAIR NO_ANSWER DESPAIR JOY JOY NO_ANSWER DESPAIR DESPAIR MEH ... NO_ANSWER MEH JOY JOY MEH JOY MEH NO_ANSWER JOY 1040

103 rows × 2307 columns

In [116]:
#check joy_count sum 
joy_count.sum()
Out[116]:
60771
In [117]:
## Add a new column called 'despair_count" using the despair_count Series above to the candy_reduced_transpose DataFrame.

candy_reduced_transpose['despair_count'] = candy_reduced_transpose['index'].map(despair_count)
candy_reduced_transpose
Out[117]:
index 0 1 2 3 4 5 6 7 8 ... 2297 2298 2299 2300 2301 2302 2303 2304 joy_count despair_count
0 Q6 | 100 Grand Bar MEH NO_ANSWER MEH JOY JOY NO_ANSWER JOY JOY MEH ... JOY MEH JOY JOY MEH MEH NO_ANSWER DESPAIR 828 80
1 Q6 | Anonymous brown globs that come in black ... DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR 165 1017
2 Q6 | Any full-sized candy bar JOY NO_ANSWER JOY JOY JOY NO_ANSWER JOY JOY JOY ... JOY JOY JOY MEH JOY JOY NO_ANSWER JOY 1477 15
3 Q6 | Black Jacks MEH NO_ANSWER MEH DESPAIR NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... DESPAIR DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR NO_ANSWER DESPAIR 87 754
4 Q6 | Bonkers (the candy) DESPAIR NO_ANSWER MEH MEH NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... MEH MEH MEH DESPAIR NO_ANSWER MEH NO_ANSWER DESPAIR 109 467
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
98 Q6 | Vicodin DESPAIR NO_ANSWER JOY JOY DESPAIR NO_ANSWER MEH DESPAIR DESPAIR ... JOY DESPAIR MEH JOY NO_ANSWER JOY NO_ANSWER JOY 674 679
99 Q6 | Whatchamacallit Bars DESPAIR NO_ANSWER JOY JOY JOY NO_ANSWER DESPAIR MEH MEH ... MEH NO_ANSWER JOY DESPAIR JOY MEH NO_ANSWER DESPAIR 802 273
100 Q6 | White Bread DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR NO_ANSWER MEH 43 1376
101 Q6 | Whole Wheat anything DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... MEH DESPAIR MEH DESPAIR MEH DESPAIR NO_ANSWER DESPAIR 110 1233
102 Q6 | York Peppermint Patties DESPAIR NO_ANSWER DESPAIR JOY JOY NO_ANSWER DESPAIR DESPAIR MEH ... MEH JOY JOY MEH JOY MEH NO_ANSWER JOY 1040 222

103 rows × 2308 columns

In [118]:
#check despair_count sum
despair_count.sum()
Out[118]:
55169

Exercise_AS: Add a new column to the candy_reduced_transpose DataFrame called "net_feelies" that takes the joy_count column and subtracts the despair_count column.

In [119]:
# Add a new column to the candy_reduced_transpose DataFrame called "net_feelies" 
# that takes the joy_count column and subtracts the despair_count column

candy_reduced_transpose['net_feelies'] = candy_reduced_transpose['joy_count']-candy_reduced_transpose['despair_count']                           ### ENTER CODE HERE ###
candy_reduced_transpose
Out[119]:
index 0 1 2 3 4 5 6 7 8 ... 2298 2299 2300 2301 2302 2303 2304 joy_count despair_count net_feelies
0 Q6 | 100 Grand Bar MEH NO_ANSWER MEH JOY JOY NO_ANSWER JOY JOY MEH ... MEH JOY JOY MEH MEH NO_ANSWER DESPAIR 828 80 748
1 Q6 | Anonymous brown globs that come in black ... DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR 165 1017 -852
2 Q6 | Any full-sized candy bar JOY NO_ANSWER JOY JOY JOY NO_ANSWER JOY JOY JOY ... JOY JOY MEH JOY JOY NO_ANSWER JOY 1477 15 1462
3 Q6 | Black Jacks MEH NO_ANSWER MEH DESPAIR NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR NO_ANSWER DESPAIR 87 754 -667
4 Q6 | Bonkers (the candy) DESPAIR NO_ANSWER MEH MEH NO_ANSWER NO_ANSWER DESPAIR MEH MEH ... MEH MEH DESPAIR NO_ANSWER MEH NO_ANSWER DESPAIR 109 467 -358
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
98 Q6 | Vicodin DESPAIR NO_ANSWER JOY JOY DESPAIR NO_ANSWER MEH DESPAIR DESPAIR ... DESPAIR MEH JOY NO_ANSWER JOY NO_ANSWER JOY 674 679 -5
99 Q6 | Whatchamacallit Bars DESPAIR NO_ANSWER JOY JOY JOY NO_ANSWER DESPAIR MEH MEH ... NO_ANSWER JOY DESPAIR JOY MEH NO_ANSWER DESPAIR 802 273 529
100 Q6 | White Bread DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... DESPAIR MEH MEH DESPAIR DESPAIR NO_ANSWER MEH 43 1376 -1333
101 Q6 | Whole Wheat anything DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR NO_ANSWER DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR MEH DESPAIR NO_ANSWER DESPAIR 110 1233 -1123
102 Q6 | York Peppermint Patties DESPAIR NO_ANSWER DESPAIR JOY JOY NO_ANSWER DESPAIR DESPAIR MEH ... JOY JOY MEH JOY MEH NO_ANSWER JOY 1040 222 818

103 rows × 2309 columns

In [120]:
## net_feelies sum check

# np.random.seed(56)

# count = np.random.uniform(low = -57, high = 9, size = (len(candy_reduced_transpose))).round(0)

# net_feelies = count
# net_feelies.sum()

Exercise_AT: Select only the joy_count, despair_count, and net_feelies columns from the candy_reduced_transpose DataFrame. Sort this DataFrame in descending order by net_feelies and save this as candy_net_sorted.

In [121]:
### This needs to be corrected! ###

candy_net_sorted = candy_reduced_transpose[['joy_count', 'despair_count', 'net_feelies']]#.sort_values(by='net_feelies', ascending=False)                          
candy_net_sorted.sort_values(by='net_feelies', ascending=False) 
Out[121]:
joy_count despair_count net_feelies
2 1477 15 1462
75 1416 88 1328
44 1367 47 1320
14 1363 63 1300
96 1339 67 1272
... ... ... ...
20 84 1356 -1272
74 86 1398 -1312
100 43 1376 -1333
28 43 1386 -1343
8 24 1535 -1511

103 rows × 3 columns

In [122]:
###### net_feelies sum check ######
## This need correction. Answer should be 5602, not -2523 ##

candy_net_sorted['net_feelies'].sum()      
Out[122]:
5602

Be prepared to answer what candy/item had the most and least net_feelies values.

Encoding DataFrame¶

We now want to get the candy DataFrame ready to run a machine learning algorthim to determine if we could predict a person's gender based on what candy they prefer.

You will learn more about this in the machine learning classes, but some algorithms work exclusively with numeric values. We will now turn all of our values into numeric values. There are easier ways to do this with sklearn, which you will study in later courses, but we will use Pandas to perform these exercises for further practice.

Exercise_AU: For grading purposes, we want to leave the candy DataFrame as is. Make a copy of the candy DataFrame and save this new DataFrame as candy_encode.

In [123]:
candy_encode = candy.copy()
candy_encode.shape
Out[123]:
(2305, 113)

Exercise_AV: For the candy_encode DataFrame, replace any Female values with 0 and any Male values with 1.

In [124]:
# For the candy_encode DataFrame, replace any Female values with 0 and any Male values with 1

candy_encode['Q2: GENDER'].replace(['Female','Male'],[0,1], inplace=True)

candy_encode
Out[124]:
Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) ... Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science] Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo]
0 No 1 36-45 USA MEH DESPAIR JOY MEH DESPAIR DESPAIR ... DESPAIR DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
1 No 1 46-55 USA NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
2 No 1 36-45 USA MEH DESPAIR JOY MEH MEH DESPAIR ... JOY DESPAIR DESPAIR DESPAIR White and gold Sunday 0.0 1.0 0.0 0.0
3 No 1 18-25 USA JOY DESPAIR JOY DESPAIR MEH DESPAIR ... JOY DESPAIR DESPAIR JOY White and gold Friday 0.0 1.0 0.0 0.0
4 No 1 unknown Other JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER ... JOY DESPAIR DESPAIR JOY NO_ANSWER NO_ANSWER 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2300 No 1 18-25 USA JOY DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR MEH DESPAIR MEH White and gold Friday 0.0 0.0 0.0 0.0
2301 No 0 26-35 USA MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER ... JOY DESPAIR MEH JOY Blue and black Friday 0.0 1.0 0.0 0.0
2302 No 0 26-35 USA MEH DESPAIR JOY DESPAIR MEH JOY ... MEH DESPAIR DESPAIR MEH Blue and black Friday 0.0 1.0 0.0 0.0
2303 No 1 56+ USA NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER 0.0 0.0 0.0 0.0
2304 No 0 56+ USA DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR ... DESPAIR MEH DESPAIR JOY White and gold Sunday 1.0 0.0 0.0 0.0

2305 rows × 113 columns

Exercise_AW: Again, you will learn more about this later, but we need to separate the column that we want to predict (called the response) and the columns that we will use to make the predictions (called the features). For both of the items below, make sure that the index is reset and goes from 0 to n-1.

  • Select only the Q2: GENDER column from candy_encode and save this as candy_response. Note: This should be a Series.
  • Drop the following columns from the candy_encode DataFrame: Q2: GENDER,Q1: GOING OUT?,Q3: AGE,Q4: COUNTRY,Q10: DRESS,Q11: DAY, Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]. Save the remaining columns as candy_features.
In [125]:
# Select only the Q2: GENDER column from candy_encode and save this as candy_response(This should be a Series)

candy_response = candy_encode['Q2: GENDER']

candy_response
Out[125]:
0       1
1       1
2       1
3       1
4       1
       ..
2300    1
2301    0
2302    0
2303    1
2304    0
Name: Q2: GENDER, Length: 2305, dtype: int64
In [126]:
### Drop the following columns from the candy_encode DataFrame: 
    # Q1: GOING OUT?,Q2: GENDER, Q3: AGE,Q4: COUNTRY,Q10: DRESS,Q11: DAY, Q12: MEDIA [Daily Dish],
    # Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo].
    
# Save the remaining columns as candy_features
    
    
candy_features = candy_encode.drop(columns = ['Q1: GOING OUT?','Q2: GENDER', 'Q3: AGE', 'Q4: COUNTRY', 
                                              'Q10: DRESS', 'Q11: DAY', 'Q12: MEDIA [Daily Dish]', 
                                              'Q12: MEDIA [Science]', 'Q12: MEDIA [ESPN]', 
                                              'Q12: MEDIA [Yahoo]'], axis =1)

candy_features
Out[126]:
Q6 | 100 Grand Bar Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes) Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy) Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins Q6 | Broken glow stick Q6 | Butterfinger ... Q6 | Three Musketeers Q6 | Tolberone something or other Q6 | Trail Mix Q6 | Twix Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread Q6 | Whole Wheat anything Q6 | York Peppermint Patties
0 MEH DESPAIR JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR ... JOY JOY DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR
1 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER
2 MEH DESPAIR JOY MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH ... DESPAIR JOY MEH JOY DESPAIR JOY JOY DESPAIR DESPAIR DESPAIR
3 JOY DESPAIR JOY DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH ... JOY JOY DESPAIR JOY MEH JOY JOY DESPAIR DESPAIR JOY
4 JOY DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER MEH MEH DESPAIR JOY ... JOY JOY MEH JOY DESPAIR DESPAIR JOY DESPAIR DESPAIR JOY
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2300 JOY DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR DESPAIR MEH ... MEH MEH JOY JOY MEH JOY DESPAIR MEH DESPAIR MEH
2301 MEH DESPAIR JOY NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER DESPAIR DESPAIR JOY ... MEH MEH DESPAIR JOY NO_ANSWER NO_ANSWER JOY DESPAIR MEH JOY
2302 MEH DESPAIR JOY DESPAIR MEH JOY DESPAIR MEH MEH DESPAIR ... JOY JOY MEH MEH MEH JOY MEH DESPAIR DESPAIR MEH
2303 NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER ... NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER NO_ANSWER
2304 DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR JOY ... JOY MEH DESPAIR DESPAIR DESPAIR JOY DESPAIR MEH DESPAIR JOY

2305 rows × 103 columns

Exercise_AX: Use Panda's get_dummies() to encode the candy_features data, making sure to set drop_first=True. Save this as candy_features_encoded.

In [127]:
# Use Panda's get_dummies() to encode the candy_features data, making sure to set drop_first=True. 
# Save this as candy_features_encoded

candy_features_encoded = pd.get_dummies((candy_features), drop_first=True) 

candy_features_encoded
Out[127]:
Q6 | 100 Grand Bar_JOY Q6 | 100 Grand Bar_MEH Q6 | 100 Grand Bar_NO_ANSWER Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_JOY Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_MEH Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)_NO_ANSWER Q6 | Any full-sized candy bar_JOY Q6 | Any full-sized candy bar_MEH Q6 | Any full-sized candy bar_NO_ANSWER Q6 | Black Jacks_JOY ... Q6 | Whatchamacallit Bars_NO_ANSWER Q6 | White Bread_JOY Q6 | White Bread_MEH Q6 | White Bread_NO_ANSWER Q6 | Whole Wheat anything_JOY Q6 | Whole Wheat anything_MEH Q6 | Whole Wheat anything_NO_ANSWER Q6 | York Peppermint Patties_JOY Q6 | York Peppermint Patties_MEH Q6 | York Peppermint Patties_NO_ANSWER
0 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 1 0 0 1 0 ... 1 0 0 1 0 0 1 0 0 1
2 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2300 1 0 0 0 0 0 0 1 0 0 ... 0 0 1 0 0 0 0 0 1 0
2301 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 0 1 0 0
2302 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2303 0 0 1 0 0 1 0 0 1 0 ... 1 0 0 1 0 0 1 0 0 1
2304 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 1 0 0

2305 rows × 309 columns

Code Check: Make sure that the candy, candy_features, candy_response and candy_features_encoded has an index that goes from 0 to n-1 or your final CodeGrade tests will not pass.

Final Analysis¶

Great work! You have now cleaned your data and prepared it to be passed to a machine learning model.

I created models using Random Forest, Logistic Regression, and XGBoost algorithms, and they all returned around 70% accuracy rates. However, the other accuracy metrics (that you will learn about more in the machine learning classes) didn't look as good. Given the metrics that were calculated, I would say that based only on this data, using candy preference is not that great of an indicator of someone's gender.

Next Steps: Make sure that your notebook passes all the CodeGrade tests and then use this notebook to answer questions in the corresponding quiz in Brightspace.